1 Introduction

A fundamental tool for finding a root of a monotone operator is the proximal point method [46, 59]. The monotone operator theory is particularly of interest, since it is closely related to convex functions and convex minimization [7, 14, 60]. For example, the proximal point method is useful when solving ill-conditioned problems or dual problems. In particular, the augmented Lagrangian method (i.e., the method of multipliers) [36, 56] and the alternating direction method of multipliers (ADMM) [26, 27] are instances of the proximal point method applied to dual problems [23, 24, 58].

To improve the efficiency of the proximal point method, accelerating its worst-case rate has been of interest both in theory and in applications (see e.g., [1, 3, 6, 15, 29, 34, 43]). In specific, inspired by Nesterov’s fast gradient method [51, 52], Güler [34] accelerated the worst-case rate of the proximal point method for convex minimization with respect to the cost function. This yields the fast rate \(O(1/i^2)\) where i denotes the number of iterations, compared to the O(1/i) rate of the proximal point method. However, this acceleration has not been theoretically generalized to the monotone inclusion problem, and only somewhat empirical accelerations, e.g., via the relaxation and the inertia (i.e., an implicit version of the heavy ball method [55], or equivalently, Nesterov’s and Güler’s accelerating technique [34, 51, 52]) in [1, 3, 6, 15, 29], have been studied. Therefore, this paper studies accelerating the worst-case rate of the proximal point method with respect to the fixed-point residual for maximally monotone operators. This provides the fast \(O(1/i^2)\) rate, which improves upon the rate O(1/i) of the proximal point method [9, 32]. The proof is computer-assisted via the performance estimation problem (PEP) approach [21] and its extensions [20, 22, 31, 32, 37,38,39,40,41, 61, 65,66,67].

Under the additional strong monotonicity condition, the proximal point method has a linear rate in terms of the fixed-point residual [59], while the proposed acceleration is not guaranteed to have such a linear rate. Therefore, this paper further employs a restarting technique (e.g., [50, Section 11.4], [53, Section 5.1]) under the strong monotonicity condition. This has a linear rate, and is faster than the proximal point method for some practical cases.

The proposed acceleration of the proximal point method has wide applications. This provides an acceleration to the proximal method of multipliers [58], the Douglas–Rachford splitting method [19, 44], and ADMM [26, 27]. The proposed result also applies to a preconditioned proximal point method such as the primal–dual hybrid gradient (PDHG) method [11, 13, 25, 35], (i.e., a preconditioned ADMM), yielding an accelerated PDHG method. This paper then shows that the proposed acceleration applies to a forward method for cocoercive operators. Existing works on accelerating the forward method can be found, for example, in [2, 45].

Section 2 reviews maximally monotone operators, the proximal point method and its known accelerations. Section 3 studies the PEP with respect to the fixed-point residual for monotone inclusion problems. Section 4 proposes a new accelerated proximal point method using the PEP. Section 5 considers a restarting technique to yield a linear rate, under the additional strongly monotone assumption. Section 6 applies the proposed acceleration to well-known instances of the proximal point method, such as the proximal method of multipliers, the PDHG method, the Douglas–Rachford splitting method, and ADMM. Section 6 also provides numerical experiments. Section 7 presents that the proposed approach also accelerates the forward method for cocoercive operators, and Sect. 8 concludes.

2 Problem and method

2.1 Monotone inclusion problem

Let \({\mathcal {H}}\) be a real Hilbert space equipped with inner product \(\mathop {\langle \cdot ,\, \cdot \rangle }\nolimits \), and associated norm \(||\cdot ||\). A set-valued operator \({\varvec{M}} \;:\;{\mathcal {H}}\rightarrow 2^{\mathcal {H}}\) is monotone if

$$\begin{aligned} \mathop {\langle {\varvec{x}}- {\varvec{y}},\, {\varvec{u}}- {\varvec{v}} \rangle }\nolimits \ge 0 \text { for all } ({\varvec{x}},{\varvec{u}}),({\varvec{y}},{\varvec{v}})\in {\text {gra}}{\varvec{M}} , \end{aligned}$$
(1)

where \({\text {gra}}{\varvec{M}} {:}{=} \{({\varvec{x}},{\varvec{u}})\in {\mathcal {H}}\times {\mathcal {H}}\;:\,{\varvec{u}} \in {\varvec{M}} {\varvec{x}} \}\) denotes the graph of \({\varvec{M}} \). A monotone operator \({\varvec{M}} \) is maximally monotone if there exists no monotone operator \({\varvec{A}} \;:\;{\mathcal {H}}\rightarrow 2^{\mathcal {H}}\) such that \({\text {gra}}{\varvec{A}} \) properly contains \({\text {gra}}{\varvec{M}} \). Let \({\mathcal {M}}({\mathcal {H}})\) be the class of maximally monotone operators on \({\mathcal {H}}\). In addition, a set-valued operator \({\varvec{M}} \;:\;{\mathcal {H}}\rightarrow 2^{\mathcal {H}}\) is \(\mu \)-strongly monotone for \(\mu \in \mathbb {R} _{++}\), if

$$\begin{aligned} \mathop {\langle {\varvec{x}}- {\varvec{y}},\, {\varvec{u}}- {\varvec{v}} \rangle }\nolimits \ge \mu ||{\varvec{x}}- {\varvec{y}} ||^2 \text { for all } ({\varvec{x}},{\varvec{u}}),({\varvec{y}},{\varvec{v}})\in {\text {gra}}{\varvec{M}} . \end{aligned}$$
(2)

Let \({\mathcal {M}}_{\mu }({\mathcal {H}})\) be the class of maximally and \(\mu \)-strongly monotone operators on \({\mathcal {H}}\). Also, define \({\mathcal {B}}({\mathcal {H}},{\mathcal {G}}) = \{{\varvec{L}} \;:\;{\mathcal {H}}\rightarrow {\mathcal {G}}\;|\;{\varvec{L}} \text { is linear and bounded}\}\) for a real Hilbert space \({\mathcal {G}}\) equipped with inner product \(\mathop {\langle \cdot ,\, \cdot \rangle }\nolimits \), and let \({\varvec{L}} ^*\in {\mathcal {B}}({\mathcal {G}},{\mathcal {H}})\) be the adjoint of \({\varvec{L}} \in {\mathcal {B}}({\mathcal {H}},{\mathcal {G}})\) that satisfies \(\mathop {\langle {\varvec{L}} {\varvec{x}},\, {\varvec{y}} \rangle }\nolimits = \mathop {\langle {\varvec{x}},\, {\varvec{L}} ^*{\varvec{y}} \rangle }\nolimits \) for all \({\varvec{x}} \in {\mathcal {H}}\) and \({\varvec{y}} \in {\mathcal {G}}\).

This paper considers the monotone inclusion problem:

$$\begin{aligned} {\text {Find}}\;\; {\varvec{x}} \in {\mathcal {H}}\quad \text {subject to} \quad {\varvec{0}} \in {\varvec{M}} {\varvec{x}} , \end{aligned}$$
(3)

where \({\varvec{M}} \in {\mathcal {M}}({\mathcal {H}})\) (or \({\varvec{M}} \in {\mathcal {M}}_{\mu }({\mathcal {H}})\)). This includes convex problems and convex–concave problems; a subdifferential \(\partial f\) of a closed proper convex function \(f\;:\;{\mathcal {H}}\rightarrow \mathbb {R} \cup \{\infty \}\) is maximally monotone [48]. Let \({\mathcal {F}}({\mathcal {H}})\) be the class of closed proper convex functions on \({\mathcal {H}}\).

We assume that the optimal set \(X_*({\varvec{M}}){:}{=} \{{\varvec{x}} \in {\mathcal {H}}\;:\; {\varvec{0}} \in {\varvec{M}} {\varvec{x}} \}\) is nonempty. We also assume that the distance between an initial point \({\varvec{x}} _0\) and some optimal point \({\varvec{x}} _*\in X_*({\varvec{M}})\) is bounded as

$$\begin{aligned} ||{\varvec{x}} _0 - {\varvec{x}} _*|| \le R \quad \text {for a constant } R>0 . \end{aligned}$$
(4)

2.2 Proximal point method and its worst-case rates

Proximal point method was first introduced to convex optimization by Martinet [46], which is based on the proximal mapping by Moreau [49]. The method was later extended to monotone inclusion problem by Rockafellar [59]. The proximal point method for maximally monotone operators includes the augmented Lagrangian [36, 56], the proximal method of multipliers [58], the Douglas–Rachford splitting method [19, 44], and the alternating direction method of multipliers (ADMM) [26, 27], so studying its worst-case convergence behavior and acceleration is important, which is of main interest in this paper.

The proximal mapping [49] (or the resolvent operator) of an operator \({\varvec{M}} \) is defined as

$$\begin{aligned} {\varvec{J}} _{{\varvec{M}}} := ({\varvec{I}} + {\varvec{M}})^{-1} , \end{aligned}$$
(5)

where \({\varvec{I}} \;:\;{\mathcal {H}}\rightarrow {\mathcal {H}}\) is an identity operator, i.e., \({\varvec{I}} ({\varvec{x}})={\varvec{x}} \) for all \({\varvec{x}} \in {\mathcal {H}}\). The resolvent operator \({\varvec{J}} _{{\varvec{M}}}\) is single-valued and firmly nonexpansive for \({\varvec{M}} \in {\mathcal {M}}({\mathcal {H}})\) [47]. The proximal point method [46, 59] generates a sequence \(\{{\varvec{x}} _i\}\) by iteratively applying the resolvent operator with a positive real number \(\lambda \) as below.

figure a

In [9, Proposition 8], the worst-case rate of the proximal point method with respect to the fixed-point residual

$$\begin{aligned} ||{\varvec{x}}- {\varvec{J}} _{\lambda {\varvec{M}}}({\varvec{x}})||^2 \end{aligned}$$
(6)

was found to satisfy

$$\begin{aligned} ||{\varvec{x}} _i - {\varvec{x}} _{i-1}||^2 \le \frac{R^2}{i} \end{aligned}$$
(7)

for \(i\ge 1\). Very recently in [32], this was improved to

$$\begin{aligned} ||{\varvec{x}} _i - {\varvec{x}} _{i-1}||^2 \le \left( 1 - \frac{1}{i}\right) ^{i-1}\frac{R^2}{i} ,\end{aligned}$$
(8)

which is exact when \(\dim {\mathcal {H}}\ge 2\). Such exact worst-case with \(\dim {\mathcal {H}}= 2\) given in [32] will be visited at the end of Sect. 4. The bound (8) is asymptotically e-times lower than (7), where e is Euler’s number. When we additionally assume the \(\mu \)-strong monotonicity, the proximal point method has a linear rate [7, Example 23.40] [59]

$$\begin{aligned} ||{\varvec{x}} _{i+1} - {\varvec{x}} _i||^2 \le \left( \frac{1}{1+\lambda \mu }\right) ^2||{\varvec{x}} _i - {\varvec{x}} _{i-1}||^2 \end{aligned}$$
(9)

for \(i\ge 1\), which is exact considering the case \({\varvec{M}} {\varvec{x}} = \mu {\varvec{x}} \) with \(\dim {\mathcal {H}}= 1\).

For a convex minimization of \(f\in {\mathcal {F}}({\mathcal {H}})\), [66, Conjecture 4.2] conjectures that the proximal point method satisfies

$$\begin{aligned} ||{\varvec{x}} _i - {\varvec{x}} _{i-1}||^2 \le \frac{R^2}{i^2} \end{aligned}$$
(10)

for \(i\ge 1\), which is faster than (8) for maximally monotone operators. In addition, the O(1/i) worst-case rate of the proximal point method with respect to the cost function was studied in [33, Theorem 2.1], and this was improved by a constant 2 in [66, Theorem 4.1]

$$\begin{aligned} f({\varvec{x}} _i) - f({\varvec{x}} _*) \le \frac{R^2}{4\lambda i} \end{aligned}$$
(11)

for \(i\ge 1\) and some \({\varvec{x}} _* \in X_*(\partial f)\) with \(||{\varvec{x}} _0 - {\varvec{x}} _*|| \le R\).

Remark 2.1

The results for the proximal point method can be applied to a preconditioned proximal point method. Let \({\varvec{L}} \in {\mathcal {B}}({\mathcal {H}},{\mathcal {H}})\) be invertible. Then, \({\varvec{L}} ^*{\varvec{M}} {\varvec{L}} \) is maximally monotone for \({\varvec{M}} \in {\mathcal {M}}({\mathcal {H}})\) [7, Proposition 23.25], and the corresponding proximal point method is

$$\begin{aligned} \tilde{{\varvec{x}}}_{i+1} = {\varvec{J}} _{\lambda {\varvec{L}} ^*{\varvec{M}} {\varvec{L}}}(\tilde{{\varvec{x}}}_i) = ({\varvec{I}} + \lambda {\varvec{L}} ^*{\varvec{M}} {\varvec{L}})^{-1}\tilde{{\varvec{x}}}_i .\end{aligned}$$
(12)

Introducing \({\varvec{x}} _i = {\varvec{L}} \tilde{{\varvec{x}}}_i\) and \({\varvec{P}} = ({\varvec{L}} {\varvec{L}} ^*)^{-1}\) yields the following equivalent preconditioned proximal point method

$$\begin{aligned} {\varvec{x}} _{i+1} = ({\varvec{P}} + \lambda {\varvec{M}})^{-1}{\varvec{P}} {\varvec{x}} _i .\end{aligned}$$
(13)

So, for example, the inequality (7) leads to the preconditioned fixed-point residual bound for the preconditioned proximal point method

$$\begin{aligned} \mathop {\langle {\varvec{P}} ({\varvec{x}} _i - {\varvec{x}} _{i-1}),\, {\varvec{x}} _i - {\varvec{x}} _{i-1} \rangle }\nolimits \le \frac{R^2}{i} \end{aligned}$$
(14)

for \(i\ge 1\), and for some \({\varvec{x}} _*\in X_*({\varvec{M}})\) with \(\mathop {\langle {\varvec{P}} ({\varvec{x}} _0 - {\varvec{x}} _*),\, {\varvec{x}} _0 - {\varvec{x}} _* \rangle }\nolimits \le R^2\). This is particularly useful when considering the PDHG method [11, 13, 25, 35, Lemma 2.2], which is an instance of a preconditioned proximal point method. We will revisit this in Sect. 6.2.

2.3 Existing accelerations for proximal point method

This section reviews existing accelerations of proximal point method for convex minimization with respect to the cost function. To the best of our knowledge, there is no other type of proximal point methods that guarantees accelerated worst-case rates.

For convex minimization, Güler [34] developed the following two accelerated versions, inspired by Nesterov’s fast gradient method [51, 52]. The following is the first accelerated version of the proximal point method in [34] which is an instance of FISTA [8]. The original version in [34] includes some variation with an iteration-dependent \(\lambda _i\), rather than a fixed constant \(\lambda \) (see also [5] for choosing \(\lambda _i\) appropriate for further acceleration). This paper focuses on a fixed constant \(\lambda \), and we leave its extension to a varying constant \(\lambda _i\) as future work.

figure b

The sequence generated by the Güler’s first accelerated proximal point method satisfies [34, Theorem 2.3], [8, Theorem 4.4]

$$\begin{aligned} f({\varvec{x}} _i) - f({\varvec{x}} _*) \le \frac{R^2}{2\lambda t_{i-1}^2} \le \frac{2R^2}{\lambda (i+1)^2} \end{aligned}$$
(15)

for \(i\ge 1\) and for some \({\varvec{x}} _* \in X_*(\partial f)\) with \(||{\varvec{x}} _0 - {\varvec{x}} _*|| \le R\). The following is another accelerated proximal point method by Güler [34], which the formulation is similar to those of the optimized gradient methods [37, 39, 40].

figure c

The sequence generated by Güler’s second accelerated proximal point method satisfies [34, Theorem 6.1] for \(i\ge 1\)

$$\begin{aligned} f({\varvec{x}} _i) - f({\varvec{x}} _*) \le \frac{R^2}{4\lambda t_{i-1}^2} \le \frac{R^2}{\lambda (i+1)^2}, \end{aligned}$$
(16)

which is twice smaller than (15).

2.4 Main contribution

To accelerate the worst-case rate of the proximal point method for maximally monotone operators, the relaxation and the inertia (i.e., an implicit version of the heavy ball method [55], or equivalently, Nesterov’s and Güler’s accelerating technique [34, 51, 52]) have been studied in [1, 3, 6, 15, 29]. However, none of them guarantee accelerated rates. Therefore, the main contribution of this paper is to develop a method that has a fast \(O(1/i^2)\) rate with respect to the fixed-point residual, improving upon the O(1/i) rate of the proximal point method in (7) and (8).

This paper considers the following general proximal point method with step coefficients \(\{h_{i+1,k+1}\}_{k=0}^i\) for reusing previous and current updates \(\{{\varvec{x}} _{k+1} - {\varvec{y}} _k\}_{k=0}^i\). This includes the proximal point method, the accelerated methods via the relaxation and the inertia [1, 3, 6, 15, 29], and the proposed accelerated method.

figure d

This paper next uses the PEP approach to find the choice of \(\{h_{i+1,k+1}\}_{k=0}^i\) that guarantees an accelerated rate. While the formulation of the general proximal point method is inefficient in general, the proposed accelerated method with the specific choice of \(\{h_{i+1,k+1}\}_{k=0}^i\) found by PEP has an efficient equivalent form. This form is similar to the other accelerated methods with the relaxation and/or the inertia [1, 3, 6, 15, 29].

3 Performance estimation problem for maximally monotone operators

This section uses the performance estimation problem (PEP) approach [21, 66, 67] to analyze the general proximal point method for maximally monotone operators, in terms of the fixed-point residual (6). This was recently studied in [32] for the proximal point method, providing the exact rate (8). The same authors [31] also used the PEP to study the exact worst-case rate for the ergodic sequence of the (relaxed) proximal point method for the variational inequalities. Similarly, [66] used PEP to analyze the worst-case rate of the proximal point method for convex minimization in terms of the fixed-point residual and the cost function, yielding (10) and (11), respectively.

Building upon [21, 31, 32, 66, 67], the worst-case rate of the general proximal point method after N iterations for decreasing the fixed-point residual (6) under the initial distance condition (4) can be computed by

$$\begin{aligned} \max _{{\varvec{M}} \in {\mathcal {M}}({\mathcal {H}})}\max _{\begin{array}{c} {\varvec{x}} _1,\ldots ,{\varvec{x}} _N\in {\mathcal {H}}, \\ {\varvec{y}} _0,\ldots ,{\varvec{y}} _{N-1}\in {\mathcal {H}}, \\ {\varvec{x}} _*\in X_*({\varvec{M}}) \end{array}}\;&\frac{1}{R^2}||{\varvec{x}} _N - {\varvec{y}} _{N-1}||^2 \nonumber \\ \text {subject to }\;&{\varvec{x}} _{i+1} = {\varvec{J}} _{\lambda {\varvec{M}}}({\varvec{y}} _i), \quad i=0,\ldots ,N-1, \nonumber \\&{\varvec{y}} _{i+1} = {\varvec{y}} _i + \sum _{k=0}^i h_{i+1,k+1}({\varvec{x}} _{k+1} - {\varvec{y}} _k), \quad i=0,\ldots ,N-2, \nonumber \\&||{\varvec{y}} _0 - {\varvec{x}} _*||^2 \le R^2. \end{aligned}$$
(17)

This is an infinite-dimensional problem due to the constraint \({\varvec{M}} \in {\mathcal {M}}({\mathcal {H}})\), which is impractical to solve. PEP in [21] further introduced a series of steps that reformulate such impractical problem into a tractable problem, which we apply to (17) step by step below.

The first step is to reformulate the problem (17) into a finite-dimensional problem. [61, Fact 1] implies that one can replace \({\varvec{M}} \in {\mathcal {M}}({\mathcal {H}})\) in (17) by a set of inequality constraints (1) for \({\varvec{M}} \in {\mathcal {M}}({\mathcal {H}})\) on the finite number of pairs of points \(\{{\varvec{x}} _1,\ldots ,{\varvec{x}} _N,{\varvec{x}} _*\}\) without strictly relaxing the problem (17). In specific, such constraints are

$$\begin{aligned} \mathop {\langle {\varvec{x}} _i - {\varvec{x}} _j,\, {\varvec{q}} _i - {\varvec{q}} _j \rangle }\nolimits \ge 0, \end{aligned}$$
(18)

for all \(i,j\in \{1,\ldots ,N,*\}\), with additional variables \({\varvec{q}} _i\in {\varvec{M}} {\varvec{x}} _i\) for \(i=1,\ldots ,N\) and \({\varvec{q}} _* = {\varvec{0}} \in {\varvec{M}} {\varvec{x}} _*\). Then the resulting equivalent problem of (17) is

$$\begin{aligned} \max _{\begin{array}{c} {\varvec{x}} _1,\ldots ,{\varvec{x}} _N,{\varvec{x}} _*\in {\mathcal {H}}, \\ {\varvec{y}} _0,\ldots ,{\varvec{y}} _{N-1}\in {\mathcal {H}}, \\ {\varvec{q}} _1,\ldots ,{\varvec{q}} _N\in {\mathcal {H}} \end{array}}\;&\frac{1}{R^2}||{\varvec{x}} _N - {\varvec{y}} _{N-1}||^2 \nonumber \\ \text {subject to }\;&\mathop {\langle {\varvec{x}} _i - {\varvec{x}} _j,\, {\varvec{q}} _i - {\varvec{q}} _j \rangle }\nolimits \ge 0, \quad i<j=1,\ldots ,N, \nonumber \\&\mathop {\langle {\varvec{x}} _i - {\varvec{x}} _*,\, {\varvec{q}} _i \rangle }\nolimits \ge 0, \quad i=1,\ldots ,N, \nonumber \\&{\varvec{x}} _{i+1} = {\varvec{y}} _i - \lambda {\varvec{q}} _{i+1}, \quad i=0,\ldots ,N-1, \nonumber \\&{\varvec{y}} _{i+1} = {\varvec{y}} _i - \lambda \sum _{k=0}^i h_{i+1,k+1}{\varvec{q}} _{k+1}, \quad i=0,\ldots ,N-2, \nonumber \\&||{\varvec{y}} _0 - {\varvec{x}} _*||^2 \le R^2. \end{aligned}$$
(19)

Further removing \({\varvec{x}} _i\) and using the change of variables

$$\begin{aligned} {\varvec{g}} _i {:}{=} \frac{\lambda }{R}{\varvec{q}} _i, \quad i=1,\ldots ,N, \end{aligned}$$
(20)

simplify the problem (19) as

$$\begin{aligned} \max _{\begin{array}{c} {\varvec{y}} _0,\ldots ,{\varvec{y}} _{N-1},{\varvec{x}} _*\in {\mathcal {H}}, \\ {\varvec{g}} _1,\ldots ,{\varvec{g}} _N\in {\mathcal {H}} \end{array}}\;&||{\varvec{g}} _N||^2 \nonumber \\ \text {subject to }\;&\frac{1}{R}\mathop {\langle {\varvec{y}} _{i-1} - R{\varvec{g}} _i - {\varvec{y}} _{j-1} + R{\varvec{g}} _j,\, {\varvec{g}} _i - {\varvec{g}} _j \rangle }\nolimits \ge 0, \quad i<j=1,\ldots ,N, \nonumber \\&\frac{1}{R}\mathop {\langle {\varvec{y}} _{i-1} - R{\varvec{g}} _i - {\varvec{x}} _*,\, {\varvec{g}} _i \rangle }\nolimits \ge 0, \quad i=1,\ldots ,N, \nonumber \\&{\varvec{y}} _{i+1} = {\varvec{y}} _i - R\sum _{k=0}^i h_{i+1,k+1}{\varvec{g}} _{k+1}, \quad i=0,\ldots ,N-2, \nonumber \\&||{\varvec{y}} _0 - {\varvec{x}} _*||^2 \le R^2. \end{aligned}$$
(21)

As in [21, 31, 32, 66, 67], we next introduce the Gram matrix

$$\begin{aligned} {\varvec{Z}} = \left[ \begin{array}{ccccc} ||{\varvec{g}} _1||^2 &{} \mathop {\langle {\varvec{g}} _1,\, {\varvec{g}} _2 \rangle }\nolimits &{} \cdots &{} \mathop {\langle {\varvec{g}} _1,\, {\varvec{g}} _N \rangle }\nolimits &{} \frac{1}{R}\mathop {\langle {\varvec{g}} _1,\, {\varvec{y}} _0 - {\varvec{x}} _* \rangle }\nolimits \\ \mathop {\langle {\varvec{g}} _1,\, {\varvec{g}} _2 \rangle }\nolimits &{} ||{\varvec{g}} _2||^2 &{} \cdots &{} \mathop {\langle {\varvec{g}} _2,\, {\varvec{g}} _N \rangle }\nolimits &{} \frac{1}{R}\mathop {\langle {\varvec{g}} _2,\, {\varvec{y}} _0 - {\varvec{x}} _* \rangle }\nolimits \\ \vdots &{} \vdots &{} \ddots &{} \vdots &{} \vdots \\ \mathop {\langle {\varvec{g}} _1,\, {\varvec{g}} _N \rangle }\nolimits &{} \cdots &{} &{} ||{\varvec{g}} _N||^2 &{} \frac{1}{R}\mathop {\langle {\varvec{g}} _N,\, {\varvec{y}} _0 - {\varvec{x}} _* \rangle }\nolimits \\ \frac{1}{R}\mathop {\langle {\varvec{g}} _1,\, {\varvec{y}} _0 - {\varvec{x}} _* \rangle }\nolimits &{} \cdots &{} &{} \frac{1}{R}\mathop {\langle {\varvec{g}} _N,\, {\varvec{y}} _0 - {\varvec{x}} _* \rangle }\nolimits &{} \frac{1}{R^2}||{\varvec{y}} _0 - {\varvec{x}} _*||^2 \end{array}\right] \end{aligned}$$
(22)

to relax the problem as

$$\begin{aligned} \max _{{\varvec{Z}} \in \mathbb {S} _+^{N+1}}\;&\textsf {tr} \{{\varvec{u}} _N{\varvec{u}} _N^\top {\varvec{Z}} \} \nonumber \\ \text {subject to }\;&\textsf {tr} \{{\varvec{A}} _{i,j}({\varvec{h}}){\varvec{Z}} \} \le 0, \quad i<j=1,\ldots ,N, \nonumber \\&\textsf {tr} \{{\varvec{B}} _i({\varvec{h}}){\varvec{Z}} \} \le 0, \quad i=1,\ldots ,N, \nonumber \\&\textsf {tr} \{{\varvec{C}} {\varvec{Z}} \} \le 1, \end{aligned}$$
(23)

where \(\{{\varvec{u}} _i\}_{i=1}^{N+1}\) is the canonical basis of \(\mathbb {R} ^{N+1}\) and

$$\begin{aligned} {\left\{ \begin{array}{ll} {\varvec{A}} _{i,j}({\varvec{h}}) {:}{=} ({\varvec{u}} _i-{\varvec{u}} _j)\odot ({\varvec{u}} _i - {\varvec{u}} _j) - ({\varvec{u}} _i - {\varvec{u}} _j)\odot \sum _{l=i-1}^{j-2}\sum _{k=0}^lh_{l+1,k+1}{\varvec{u}} _{k+1}, &{} i<j=1,\ldots ,N, \\ {\varvec{B}} _i({\varvec{h}}) {:}{=} {\varvec{u}} _i{\varvec{u}} _i^\top - {\varvec{u}} _i\odot {\varvec{u}} _{N+1} + {\varvec{u}} _i\odot \sum _{l=0}^{i-2}\sum _{k=0}^lh_{l+1,k+1}{\varvec{u}} _{k+1}, &{} i=1,\ldots ,N, \\ {\varvec{C}} {:}{=} {\varvec{u}} _{N+1}{\varvec{u}} _{N+1}^\top \end{array}\right. } \end{aligned}$$

with the outer product operator \({\varvec{u}} \odot {\varvec{v}} {:}{=} \frac{1}{2}({\varvec{u}} {\varvec{v}} ^\top + {\varvec{v}} {\varvec{u}} ^\top )\). If \(\dim {\mathcal {H}}\ge N+1\), the problems (17) and (23) are equivalent, based on the following lemma similar to [61, Lemma 1].

Lemma 3.1

If \(\dim {\mathcal {H}}\ge N+1\), then

$$\begin{aligned}&{\varvec{Z}} \in \mathbb {S} _+^{N+1} \quad \Leftrightarrow \quad \exists \; {\varvec{g}} _1,{\varvec{g}} _2,\ldots ,{\varvec{g}} _N,\frac{1}{R}({\varvec{y}} _0-{\varvec{x}} _*)\in {\mathcal {H}}\\&\quad \text { such that } {\varvec{Z}} = \text {expression of}~(22) . \end{aligned}$$

For simplicity in later analysis, we discard some constraints as

$$\begin{aligned} \max _{{\varvec{Z}} \in \mathbb {S} _+^{N+1}}\;&\textsf {tr} \{{\varvec{u}} _N{\varvec{u}} _N^\top {\varvec{Z}} \} \nonumber \\ \text {subject to }\;&\textsf {tr} \{{\varvec{A}} _{i-1,i}({\varvec{h}}){\varvec{Z}} \} \le 0, \quad i=2,\ldots ,N, \nonumber \\&\textsf {tr} \{{\varvec{B}} _N({\varvec{h}}){\varvec{Z}} \} \le 0, \nonumber \\&\textsf {tr} \{{\varvec{C}} {\varvec{Z}} \} \le 1, \end{aligned}$$
(24)

which does not affect the result in the paper, i.e., the optimal values of (23) and (24) are found to be numerically equivalent for the method proposed in this paper. Finally, we construct the associated Lagrangian dual of (24)

$$\begin{aligned} {\mathcal {B}}_{D}({\varvec{h}}) {:}{=} \qquad \min _{a_2,\ldots ,a_N,b_N,c\in \mathbb {R}}\;&c\\ \text {subject to }\;&\sum _{i=2}^Na_i{\varvec{A}} _{i-1,i}({\varvec{h}}) + b_N{\varvec{B}} _N({\varvec{h}}) + c{\varvec{C}}- {\varvec{u}} _N{\varvec{u}} _N^\top \succeq {\varvec{0}}, \nonumber \\&a_2,\ldots ,a_N,b_N,c \ge 0, \end{aligned}$$
(D)

where \(a_2,\ldots ,a_N,b_N,c\) are dual variables associated with the constraints of (24), respectively. Then, for any given \({\varvec{h}} \) for the general proximal point method, one can compute its (upper bound of) worst-case fixed-point residual by numerically solving (D) using any SDP solver. For some choices of \({\varvec{h}} \) as for the proximal point method in [32], it might be possible to analytically solve (D); [32] analytically solved (D) for the proximal point method yielding the rate (8). This paper provides another choice of \({\varvec{h}} \) that provides an analytical solution to (D) with an accelerated rate.

4 Accelerating the proximal point method for maximally monotone operators

Using the dual problem (D), this section develops an accelerated version of the proximal point method via PEP:

$$\begin{aligned} \min _{{\varvec{h}}} {\mathcal {B}}_{D}({\varvec{h}}) , \end{aligned}$$
(HD)

which is studied in [20,21,22, 37,38,39,40] for certain classes of problems and methods. The problem is non-convex but convex for the variables \((a_2,\ldots ,a_N,b_N,c)\) given \({\varvec{h}} \) and for the variables \((c,{\varvec{h}})\) given \((a_2,\ldots ,a_N,b_N)\). Therefore, we used a variant of alternating minimization that alternatively optimizes over \((a_2,\ldots ,a_N,b_N,c)\) given \({\varvec{h}} \) and over \((c,{\varvec{h}})\) given \((a_2,\ldots ,a_N,b_N)\) to find a minimizer using a SDP solver [16, 30]. Inspired by numerical results, the following lemma specifies a feasible point of (HD) analytically. We do not have a guarantee that such point is a (unique) minimizer of (HD).

Lemma 4.1

The following

$$\begin{aligned} h_{i,k}&= {\left\{ \begin{array}{ll} -\frac{2k}{i(i+1)}, &{} i=1,\ldots ,N-1,\;k=1,\ldots ,i-1, \\ \frac{2i}{i+1}, &{} i=1,\ldots ,N-1,\;k=i, \end{array}\right. } \end{aligned}$$
(25)
$$\begin{aligned} a_i&= \frac{2(i-1)i}{N^2}, \quad i=2,\ldots ,N, \quad b_N = \frac{2}{N}, \quad c = \frac{1}{N^2} \end{aligned}$$
(26)

is a feasible point of (D) and (HD).

Proof

It is obvious that \(a_2,\ldots ,a_N,b_N,c\) are nonnegative, so we are only left to show the positive semidefinite condition in (D). Since

$$\begin{aligned}&\; \sum _{i=2}^Na_i{\varvec{A}} _{i-1,i}({\varvec{h}}) + b_N{\varvec{B}} _N({\varvec{h}}) + c{\varvec{C}}- {\varvec{u}} _N{\varvec{u}} _N^\top \\&\quad = \; \sum _{i=2}^N\frac{2(i-1)i}{N^2}\left[ ({\varvec{u}} _{i-1} - {\varvec{u}} _i)\odot ({\varvec{u}} _{i-1} - {\varvec{u}} _i)\right. \\&\left. \qquad - ({\varvec{u}} _{i-1} - {\varvec{u}} _i)\odot \left( \frac{2(i-1)}{i}{\varvec{u}} _{i-1} - \sum _{k=0}^{i-3}\frac{2(k+1)}{(i-1)i}{\varvec{u}} _{k+1}\right) \right] \\&\qquad + \frac{2}{N}\left[ {\varvec{u}} _N{\varvec{u}} _N^\top - {\varvec{u}} _N\odot {\varvec{u}} _{N+1} + {\varvec{u}} _N\odot \sum _{l=0}^{N-2}\right. \\&\qquad \left. \left( \frac{2(l+1)}{l+2}{\varvec{u}} _{l+1}- \sum _{k=0}^{l-1}\frac{2(k+1)}{(l+1)(l+2)}{\varvec{u}} _{k+1}\right) \right] \\&\qquad + \frac{1}{N^2}{\varvec{u}} _{N+1}{\varvec{u}} _{N+1}^\top - {\varvec{u}} _N{\varvec{u}} _N^\top \\&\quad = \; \sum _{i=2}^{N-1} \left[ \frac{2(i-1)i}{N^2} + \frac{2i(i+1)}{N^2}\left( 1 - \frac{2i}{i+1}\right) \right] {\varvec{u}} _i{\varvec{u}} _i^\top + \left[ \frac{2(N-1)N}{N^2} \right. \\&\left. \qquad + \frac{2}{N} - 1\right] {\varvec{u}} _N{\varvec{u}} _N^\top + \frac{1}{N^2}{\varvec{u}} _{N+1}{\varvec{u}} _{N+1}^\top \\&\qquad + \sum _{i=2}^{N-1}\left[ \frac{2(i-1)i}{N^2}\left( -2 + \frac{2(i-1)}{i}\right) + \frac{2i(i+1)}{N^2}\frac{2(i-1)}{i(i+1)}\right] {\varvec{u}} _{i-1}\odot {\varvec{u}} _i \\&\qquad + \left[ \frac{2(N-1)}{N}\left( -2 + \frac{2(N-1)}{N}\right) + \frac{2}{N}\frac{2(N-1)}{N}\right] {\varvec{u}} _{N-1}\odot {\varvec{u}} _N - \frac{2}{N}{\varvec{u}} _N\odot {\varvec{u}} _{N+1} \\&\qquad + \sum _{i=3}^{N-1}\sum _{k=0}^{i-3}\left[ -\frac{2(i-1)i}{N^2}\frac{2(k+1)}{(i-1)i} + \frac{2i(i+1)}{N^2}\frac{2(k+1)}{i(i+1)}\right] {\varvec{u}} _{k+1}\odot {\varvec{u}} _i \\&\qquad + \sum _{k=0}^{N-3}\left[ -\frac{2(N-1)N}{N^2}\frac{2(k+1)}{(N-1)N} \right. \\&\left. \qquad + \frac{2}{N}\left( \frac{2(k+1)}{k+2} - \sum _{l=k+1}^{N-2}\frac{2(k+1)}{(l+1)(l+2)}\right) \right] {\varvec{u}} _{k+1}\odot {\varvec{u}} _N \\&\quad = \; {\varvec{u}} _N{\varvec{u}} _N^\top + \frac{1}{N^2}{\varvec{u}} _{N+1}{\varvec{u}} _{N+1}^\top - \frac{2}{N}{\varvec{u}} _N\odot {\varvec{u}} _{N+1} \\&\quad =\; \left( {\varvec{u}} _N - \frac{1}{N}{\varvec{u}} _{N+1}\right) \left( {\varvec{u}} _N - \frac{1}{N}{\varvec{u}} _{N+1}\right) ^\top \succeq {\varvec{0}}, \end{aligned}$$

the given point is a feasible point of (HD). \(\square \)

Before providing the worst-case rate of the general proximal point method with \({\varvec{h}} \) in (25), we develop its efficient formulation below. This has a low computational cost per iteration, comparable to that of the proximal point method. Note that this may not be the only efficient form for \({\varvec{h}} \) in (25).

figure e

Proposition 4.1

The sequences \(\{{\varvec{x}} _i\}\) and \(\{{\varvec{y}} _i\}\) generated by the general proximal point method with step coefficients \(\{h_{i,k}\}\) in (25) are identical to the corresponding sequence generated by the proposed accelerated proximal point method starting from the same initial point.

Proof

We use induction, and for clarity we use the notation \({\varvec{x}} _1',{\varvec{x}} _2',\ldots \) and \({\varvec{y}} _0',{\varvec{y}} _1',\ldots \) for the general proximal point method with (25). It is obvious that \({\varvec{x}} _0={\varvec{y}} _0'={\varvec{y}} _0\), \({\varvec{x}} _1' = {\varvec{x}} _1 = {\varvec{y}} _1\), and we have

$$\begin{aligned} {\varvec{y}} _1'&= {\varvec{y}} _0' + h_{1,1}({\varvec{x}} _1' - {\varvec{y}} _0') = {\varvec{x}} _1' = {\varvec{y}} _1 . \end{aligned}$$

Similarly, it is obvious that \({\varvec{x}} _2' = {\varvec{x}} _2\), and we have

$$\begin{aligned} {\varvec{y}} _2'&= {\varvec{y}} _1' + \sum _{k=0}^1h_{2,k+1}({\varvec{x}} _{k+1}' - {\varvec{y}} _k') = {\varvec{y}} _1 + \frac{4}{3}({\varvec{x}} _2 - {\varvec{y}} _1) - \frac{1}{3}({\varvec{x}} _1 - {\varvec{y}} _0) \\&= {\varvec{x}} _2 + \frac{1}{3}({\varvec{x}} _2 - {\varvec{x}} _1) - \frac{1}{3}({\varvec{x}} _1 - {\varvec{y}} _0) = {\varvec{y}} _2. \end{aligned}$$

It is then also obvious that \({\varvec{x}} _3' = {\varvec{x}} _3\). Assuming \({\varvec{x}} _l' = {\varvec{x}} _l\) for \(l=1,\ldots ,i+1\) and \({\varvec{y}} _l' = {\varvec{y}} _l\) for \(l=0,\ldots ,i\), for some \(i\ge 2\), we have

$$\begin{aligned} {\varvec{y}} _{i+1}'&= {\varvec{y}} _i' + \sum _{k=0}^ih_{i+1,k+1}({\varvec{x}} _{k+1}' - {\varvec{y}} _k') \\&= {\varvec{y}} _i + \frac{2(i+1)}{i+2}({\varvec{x}} _{i+1} - {\varvec{y}} _i) + \sum _{k=0}^{i-1}\left( -\frac{2(k+1)}{(i+1)(i+2)}\right) ({\varvec{x}} _{k+1} - {\varvec{y}} _k) \\&= {\varvec{y}} _i + \left( 1+\frac{i}{i+2}\right) ({\varvec{x}} _{i+1} - {\varvec{y}} _i) + \frac{i}{i+2}\sum _{k=0}^{i-1}\left( -\frac{2(k+1)}{i(i+1)}\right) ({\varvec{x}} _{k+1} - {\varvec{y}} _k) \\&= {\varvec{x}} _{i+1} + \frac{i}{i+2}({\varvec{x}} _{i+1} - {\varvec{y}} _i) + \frac{i}{i+2}({\varvec{y}} _i + {\varvec{y}} _{i-1} - 2{\varvec{x}} _i) \\&= {\varvec{x}} _{i+1} + \frac{i}{i+2}({\varvec{x}} _{i+1} - {\varvec{x}} _i) - \frac{i}{i+2}({\varvec{x}} _i - {\varvec{y}} _{i-1}) = {\varvec{y}} _{i+1}, \end{aligned}$$

where the fourth equality uses

$$\begin{aligned} {\varvec{y}} _i&= {\varvec{y}} _{i-1} + \frac{2i}{i+1}({\varvec{x}} _i - {\varvec{y}} _{i-1}) + \sum _{k=0}^{i-2}\left( -\frac{2(k+1)}{i(i+1)}\right) ({\varvec{x}} _{k+1} - {\varvec{y}} _k) \\&= {\varvec{y}} _{i-1} + 2({\varvec{x}} _i - {\varvec{y}} _{i-1}) + \sum _{k=0}^{i-1}\left( -\frac{2(k+1)}{i(i+1)}\right) ({\varvec{x}} _{k+1} - {\varvec{y}} _k) .\end{aligned}$$

\(\square \)

The proposed accelerated method has the inertia term \(\frac{i}{i+2}({\varvec{x}} _{i+1} - {\varvec{x}} _i)\), similar to Nesterov’s acceleration [51, 52] and Güler’s methods [34]. However, the proposed method also has a correction term \(- \frac{i}{i+2}({\varvec{x}} _i - {\varvec{y}} _{i-1})\), which is essential to guarantee an accelerated rate. Without such correction term, the accelerated method can diverge, for which we provide an example at the end of this section. We leave further understanding the role of the proposed correction term as future work, possibly via a differential equation perspective as in [64] for Nesterov’s acceleration. Note that a different correction term for Nesterov’s acceleration has been studied via the differential equation analysis for convex minimization [4, 63].

The following theorem provides an accelerated rate of the proposed method in terms of the fixed-point residual.Footnote 1

Theorem 4.1

Let \({\varvec{M}} \in {\mathcal {M}}({\mathcal {H}})\) and let \({\varvec{x}} _0,{\varvec{y}} _0,{\varvec{x}} _1,{\varvec{y}} _1,\ldots \in {\mathcal {H}}\) be generated by the proposed accelerated proximal point method. Assume that \(||{\varvec{x}} _0 - {\varvec{x}} _*||\le R\) for a constant \(R>0\) and for some \({\varvec{x}} _*\in X_*({\varvec{M}})\). Then for any \(i\ge 1\),

$$\begin{aligned} ||{\varvec{x}} _i - {\varvec{y}} _{i-1}||^2 \le \frac{R^2}{i^2} .\end{aligned}$$
(27)

Proof

Using Lemma 4.1, the general proximal point method with \({\varvec{h}} \) (25) satisfies

$$\begin{aligned} \frac{1}{R^2}||{\varvec{x}} _N - {\varvec{y}} _{N-1}||^2 \le {\mathcal {B}}_{D}({\varvec{h}}) \le \frac{1}{N^2}. \end{aligned}$$
(28)

Since the iterates of the method are recursive and do not depend on a given N, the bound (28) generalizes to the intermediate iterates of the method. By Proposition 4.1, the proposed accelerated proximal point method also satisfies the bound (28), which concludes the proof. \(\square \)

The bound (8) of the proximal point method was found to be exact in [32] by specifying a certain operator \({\varvec{M}} \) achieving the bound (8) exactly; that is, for given \(N\ge 2\), the proximal point method exactly achieves the bound (8) for the operator

$$\begin{aligned} {\varvec{M}} \left[ \begin{array}{c} u \\ v \end{array}\right] = \frac{1}{\lambda \sqrt{N-1}}\left[ \begin{array}{cc} 0\; &{} \quad 1 \\ -1\; &{} \quad 0 \end{array}\right] \left[ \begin{array}{c} u \\ v \end{array}\right] , \end{aligned}$$
(29)

with an initial point \({\varvec{x}} _0 = [1\; 0]^\top \). Such exact analysis is important since it reveals the worst-case behavior of the iterates of the method. However, we were not able to show that the bound (27) of the proposed method is exact, which we leave as future work. Instead, we compared the behavior of the iterates of the proximal point method and its accelerated variants on the operator \({\varvec{M}} \) in (29). Figure 1 compares the proximal point method, Güler’s first accelerated method with \({\varvec{M}} \) instead of \(\partial f\) (i.e., an instance of the inertia method) and the proposed accelerated method, with an initial point \({\varvec{x}} _0 = [1\; 0]^\top \) and the optimal point \({\varvec{x}} _* = {\varvec{0}} \). Note that the Güler’s first method is almost equivalent to the proposed accelerated method without the correction term \(-\frac{i}{i+2}({\varvec{x}} _i - {\varvec{y}} _{i-1})\), and this exhibits diverging behavior in Fig. 1. The figure illustrates that the correction term greatly helps the iterates to rapidly converge by reducing the radius of the orbit of the iterates, compared to other methods.

Fig. 1
figure 1

Solving a worst-case monotone inclusion problem of the proximal point method with \({\varvec{M}} \) (29) with \(N=100\); (left) the fixed-point residual vs. iteration, (right) the trajectory of the iterates \({\varvec{x}} _i = [x_{i,1},\;x_{i,2}]^\top \) (markers are displayed every 5th iterations)

We further investigate the behavior of the proposed method for a convex–concave saddle-point problem

$$\begin{aligned} \min _{{\varvec{u}} \in {\mathcal {H}}_1}\max _{{\varvec{v}} \in {\mathcal {H}}_2}\;&\phi ({\varvec{u}},{\varvec{v}}) , \end{aligned}$$
(30)

where \({\mathcal {H}}_1\) and \({\mathcal {H}}_2\) denote real Hilbert spaces equipped with inner product \(\mathop {\langle \cdot ,\, \cdot \rangle }\nolimits \), and \(\phi (\cdot ,{\varvec{v}})\in {\mathcal {F}}({\mathcal {H}}_1)\), \(-\phi ({\varvec{u}},\cdot )\in {\mathcal {F}}({\mathcal {H}}_2)\), which we further study in Sects. 5 and 6.1. The saddle subdifferential of \(\phi \),

$$\begin{aligned} \left[ \begin{array}{c} \partial _{{\varvec{u}}} \phi ({\varvec{u}},{\varvec{v}}) \\ \partial _{{\varvec{v}}} (-\phi ({\varvec{u}},{\varvec{v}})) \end{array}\right] , \end{aligned}$$
(31)

is monotone [57]. The proposed accelerated method applied to (31) with \({\varvec{x}} _i {:}{=} ({\varvec{u}} _i,{\varvec{v}} _i)\) and \({\varvec{x}} _* {:}{=} ({\varvec{u}} _*,{\varvec{v}} _*)\) (see Sect. 6.1 for details) satisfies

$$\begin{aligned} \phi ({\varvec{u}} _i,{\varvec{v}} _*) - \phi ({\varvec{u}} _*,{\varvec{v}} _i) \le \frac{||{\varvec{u}} _0 - {\varvec{u}} _*||^2 + ||{\varvec{v}} _0 - {\varvec{v}} _*||^2}{4\lambda i} \end{aligned}$$
(32)

for any \(i\ge 1\). This is numerically conjectured by the PEP analysis in (19) with the objective function \(\frac{1}{R}||{\varvec{x}} _N - {\varvec{y}} _{N-1}||^2\) and the inequality \(\mathop {\langle {\varvec{x}} _N - {\varvec{x}} _*,\, {\varvec{q}} _N \rangle }\nolimits \ge 0\) in (19) replaced by \(\phi ({\varvec{u}} _N,{\varvec{v}} _*) - \phi ({\varvec{u}} _*,{\varvec{v}} _N)\) and \(\mathop {\langle {\varvec{x}} _N - {\varvec{x}} _*,\, {\varvec{q}} _N \rangle }\nolimits \ge \phi ({\varvec{u}} _N,{\varvec{v}} _*) - \phi ({\varvec{u}} _*,{\varvec{v}} _N)\), respectively.Footnote 2

5 Restarting the accelerated proximal point method for strongly monotone operators

For strongly monotone operators, the proximal point method has a linear rate (9), whereas the proposed accelerated method is not guaranteed to have such a fast rate. Technically, one should be able to find an accelerated method for strong monotone operators via PEP, as we did for the monotone operators in the previous section. However, the resulting PEP problem, a reminiscent of (HD), is much more difficult to solve, and we leave it as future work. Instead, we consider a fixed restarting technique in [50, Section 11.4], [53, Section 5.1] that restarts an accelerated method with a sublinear rate every certain number of iterations to yield a fast linear rate, particularly for \({\varvec{M}} \in {\mathcal {M}}_{\mu }({\mathcal {H}})\) in this section.

Suppose one restarts the proposed method every k (inner) iterations by initializing the \((j+1)\)th outer iteration \({\varvec{x}} _{j+1,0} = {\varvec{y}} _{j+1,0} = {\varvec{y}} _{j+1,-1}\) by \({\varvec{x}} _{j,k}\), where \({\varvec{x}} _{j,l}\) and \({\varvec{y}} _{j,l}\) denote iterates at the jth outer iteration and lth inner iteration for \(j=0,1,\ldots \) and \(l=-1,0,1,\ldots ,k\). Using the rate (27) (with \(R = ||{\varvec{x}} _{j,0} - {\varvec{x}} _*||\)) and the strong monotonicity condition (2), we have

$$\begin{aligned} ||{\varvec{x}} _{j,k} - {\varvec{y}} _{j,k-1}||^2 \le \frac{||{\varvec{x}} _{j,0} - {\varvec{x}} _*||^2}{k^2} \le \frac{1}{\mu ^2 k^2}||{\varvec{M}} {\varvec{x}} _{j,0}||^2 \end{aligned}$$
(33)

for \(j = 0,1,\ldots \). Since \(\frac{1}{\lambda }({\varvec{x}} _{j-1,k} - {\varvec{y}} _{j-1,k-1}) \in {\varvec{M}} {\varvec{x}} _{j,0}\), we have a linear rate

$$\begin{aligned} ||{\varvec{x}} _{j,k} - {\varvec{y}} _{j,k-1}||^2 \le \frac{1}{\lambda ^2\mu ^2 k^2}||{\varvec{x}} _{j-1,k} - {\varvec{y}} _{j-1,k-1}||^2 . \end{aligned}$$
(34)

For a given \(N=jk\) total number of steps, minimizing the overall rate with respect to k yields an optimal choice of the restarting interval given by \(k_{\mathrm {opt}}{\,\approx \,} \frac{e}{\lambda \mu }\), where e is Euler’s number. The corresponding linear rate is \(O((e^{\lambda \mu /e})^{-2N})\).

We further investigate the behavior of the restarting technique for a saddle-point problem (30) with an assumption that \(\phi \) is strongly-convex–strongly-concave, i.e., \(\phi (\cdot ,{\varvec{v}}) - \frac{\mu }{2}||\cdot ||^2\in {\mathcal {F}}({\mathcal {H}}_1)\) and \(-\phi ({\varvec{u}},\cdot ) - \frac{\mu }{2}||\cdot ||^2\in {\mathcal {F}}({\mathcal {H}}_2)\). The associated saddle subdifferential (31) is \(\mu \)-strongly monotone. For such case, using the rate (32), and the inequalities \(\phi ({\varvec{u}} _*,{\varvec{v}} _*) + \frac{\mu }{2}||{\varvec{u}}- {\varvec{u}} _*||^2 \le \phi ({\varvec{u}},{\varvec{v}} _*)\) and \(-\phi ({\varvec{u}} _*,{\varvec{v}} _*) + \frac{\mu }{2}||{\varvec{v}}- {\varvec{v}} _*||^2 \le -\phi ({\varvec{u}} _*,{\varvec{v}})\), the proposed method with restarting every k iterations satisfies

$$\begin{aligned} \phi ({\varvec{u}} _{j,k},{\varvec{v}} _*) - \phi ({\varvec{u}} _*,{\varvec{v}} _{j,k}) \le \frac{1}{2\lambda \mu k}(\phi ({\varvec{u}} _{j,0},{\varvec{v}} _*) - \phi ({\varvec{u}} _*,{\varvec{v}} _{j,0})) \end{aligned}$$
(35)

for \(j=0,1,\ldots \). The associated optimal restarting interval is \(k_{\mathrm {opt}}^\phi {\,\approx \,} \frac{e}{2\lambda \mu }\), which is twice smaller than \(k_{\mathrm {opt}}\). The corresponding linear rate is also \(O((e^{\lambda \mu /e})^{-2N})\), whereas the proximal point method has the rate \(O((1+\lambda \mu )^{-2N})\) in (9). For any given positive \(\mu \), there is no positive \(\lambda \) that satisfies both \(k_{\mathrm {opt}}^\phi (\lambda ,\mu )\ge 1\) and \(e^{\lambda \mu /e} > 1+\lambda \mu \). This contrasts with the fact that the worst-case rate of optimally restarting the proposed method is not slower than that of the proximal point method. This implies that the bounds (34) and (35) are not exact, and we leave finding their tight bounds as future work. The numerical experiment below (and those in Sect. 6) suggests that the restarting technique can perform better than the proximal point method.

We consider a toy problem that is a combination of the worst-case problems in \({\mathcal {M}}({\mathcal {H}})\) and \({\mathcal {M}}_{\mu }({\mathcal {H}})\) for the proximal point method:

$$\begin{aligned} {\varvec{M}} \left[ \begin{array}{c} u \\ v \end{array}\right] = \left( \frac{1}{\lambda \sqrt{N-1}}\left[ \begin{array}{cc} 0\; &{} \quad 1 \\ -1\; &{} \quad 0 \end{array}\right] + \left[ \begin{array}{cc} \mu \; &{} \quad 0 \\ 0\; &{} \quad \mu \end{array}\right] \right) \left[ \begin{array}{c} u \\ v \end{array}\right] , \end{aligned}$$
(36)

which is the saddle subdifferential operator of \(\phi (u,v) = \frac{\mu }{2}u^2 + \frac{1}{\lambda \sqrt{N-1}}uv - \frac{\mu }{2}v^2\). We choose \(N = 100\), \(\lambda = 1\) and \(\mu = 0.02\). The optimal restarting intervals are \(k_{\mathrm {opt}} \approx 136\) and \(k_{\mathrm {opt}}^\phi \approx 68\), and we run 200 iterations in the experiment, where restarting intervals 17, 34, 68, and 136 are considered. Figure 2 compares the proximal point method, its accelerated variants, and the proposed accelerated method with restarting, with an initial point \({\varvec{x}} _0 = [1\; 0]^\top \) and the optimal point \({\varvec{x}} _* = {\varvec{0}} \). Figure 2 presents that the proximal point method has a linear rate that is faster than the proposed method (with a sublinear rate), while the restarting greatly accelerates the proposed method with a fast linear rate. Figure 2 also illustrates that the optimal restarting intervals \(k_{\mathrm {opt}}\) and \(k_{\mathrm {opt}}^\phi \) for strongly monotone operators and strongly-convex–strongly-concave functions, respectively, are not optimal for this specific case. Examples in the next section also present that the restarting can be useful even without strong monotonicity (but possibly with local strong monotonicity).

Fig. 2
figure 2

Solving a strongly monotone inclusion problem with \({\varvec{M}} \) (36); (left) the fixed-point residual versus iteration, (right) the function residual versus iteration

6 Applications of the accelerated proximal point method

As mentioned earlier, the proximal point method for maximally monotone operators include various well-known convex optimization methods. These include the augmented Lagrangian (i.e., the method of multipliers), the proximal method of multipliers, and ADMM. The augmented Lagrangian method is equivalent to the proximal point method directly solving the dual convex minimization problem [58], so Güler’s methods [34] already provide acceleration, whereas other instances of the proximal point method have no known accelerations yet. Thus, this section introduces accelerations to well-known instances of the proximal point method, which were not possible previously to the best of our knowledge (under this paper’s setting).

6.1 Accelerating the proximal point method for convex–concave saddle-point problem

This section considers a convex–concave saddle-point problem (30), where the associated saddle subdifferential operator (31) is monotone. [59] applied the proximal point method on such operator to solve the convex–concave saddle-point problem, and this section further applies the proposed acceleration to such proximal point method as below.

figure f

One primary use of this accelerated method is the following convex–concave Lagrangian problem

$$\begin{aligned} \min _{{\varvec{u}} \in {\mathcal {H}}_1}\max _{{\varvec{v}} \in {\mathcal {H}}_2}\;&\left\{ L({\varvec{u}},{\varvec{v}}) {:}{=} f({\varvec{u}}) + \mathop {\langle {\varvec{v}},\, {\varvec{A}} {\varvec{u}}- {\varvec{b}} \rangle }\nolimits \right\} , \end{aligned}$$
(37)

associated with the linearly constrained problem

$$\begin{aligned} \min _{{\varvec{u}} \in {\mathcal {H}}_1}\;&f({\varvec{u}}) \nonumber \\ \text {subject to }\;&{\varvec{A}} {\varvec{u}} = {\varvec{b}}, \end{aligned}$$
(38)

where \({\varvec{A}} \in {\mathcal {B}}({\mathcal {H}}_1,{\mathcal {H}}_2)\) and \({\varvec{b}} \in {\mathcal {H}}_2\). The resulting method is called the proximal method of multipliers in [58], and applying the proposed acceleration to this method leads to below.

figure g

Note that this method without the acceleration and the term \(\frac{1}{2\lambda }||{\varvec{u}}- {\varvec{u}} _i||^2\) reduces to the augmented Lagrangian method. This method has an advantage over the augmented Lagrangian method and its accelerated variants; the primal iterate \({\varvec{u}} _{i+1}\) is uniquely defined with a better conditioning.

Example 6.1

We apply the accelerated proximal method of multipliers to a basis pursuit problem

$$\begin{aligned} \min _{{\varvec{u}} \in \mathbb {R} ^{d_1}}\;&||{\varvec{u}} ||_1 \nonumber \\ \text {subject to }\;&{\varvec{A}} {\varvec{u}} = {\varvec{b}}, \end{aligned}$$
(39)

where \({\varvec{A}} \in \mathbb {R} ^{d_2\times d_1}\) and \({\varvec{b}} \in \mathbb {R} ^{d_2}\). In the experiment, we choose \(d_1 = 100\), \(d_2 = 20\), and randomly generated \({\varvec{A}} \). A true sparse \({\varvec{u}} _{\mathrm {true}}\) is randomly generated followed by a thresholding to sparsify nonzero elements, and \({\varvec{b}} \) is then given by \({\varvec{A}} {\varvec{u}} _{\mathrm {true}}\). We run 100 iterations of the proximal method of multipliers and its variants with \(\lambda = 0.01\) and initial \({\varvec{x}} _0 = {\varvec{0}} \). Since the \({\varvec{u}} _{i+1}\)-update does not have a closed form, we used a sufficient number of iterations to solve the \({\varvec{u}} _{i+1}\)-update using the strongly convex version of FISTA [8] in [12, Theorem 4.10].

Figure 3 compares the proximal method of multipliers and its accelerated variants. Similar to Fig. 1, Güler’s first accelerated version diverges, while the proposed method has accelerating behavior, compared to the non-accelerated version. The proposed method exhibits an oscillation in Fig. 3 (and a subtle oscillation in Fig. 1). This might be due to high momentum, owing from the acceleration, discussed in [54]. So in Fig. 3 we heuristically restarted the method every 30 iterations to avoid such oscillation and accelerate, as suggested in [54]. Developing an approach to appropriately choosing a restarting interval or adaptively restarting the method as in [54] for such problem are left as future work.Footnote 3

Fig. 3
figure 3

Solving a basis pursuit problem (39); the fixed-point residual versus iteration

6.2 Accelerating the primal–dual hybrid gradient method

This section considers a linearly coupled convex–concave saddle-point problem

$$\begin{aligned} \min _{{\varvec{u}} \in {\mathcal {H}}_1}\max _{{\varvec{v}} \in {\mathcal {H}}_2} \left\{ \phi ({\varvec{u}},{\varvec{v}}) \equiv f({\varvec{u}}) + \mathop {\langle {\varvec{K}} {\varvec{u}},\, {\varvec{v}} \rangle }\nolimits - g({\varvec{v}})\right\} , \end{aligned}$$
(40)

where \(f\in {\mathcal {F}}({\mathcal {H}}_1)\), \(g\in {\mathcal {F}}({\mathcal {H}}_2)\) and \({\varvec{K}} \in {\mathcal {B}}({\mathcal {H}}_1,{\mathcal {H}}_2)\). One widely known method for such problem is the primal–dual hybrid gradient (PDHG) method [11, 25], which is a preconditioned proximal point method (with \(\lambda = 1\)) for the saddle subdifferential operator of \(\phi \) (31) [13, 35]. The associated preconditioner is

$$\begin{aligned} {\varvec{P}} = \left[ \begin{array}{cc} \frac{1}{\tau }{\varvec{I}} \;&{}\quad -{\varvec{K}} ^* \\ -{\varvec{K}} \;&{}\quad \frac{1}{\sigma }{\varvec{I}} \end{array}\right] ,\end{aligned}$$
(41)

which is positive definite when \(\tau \sigma ||{\varvec{K}} ||^2<1\), where \(||{\varvec{K}} || = \sup _{||{\varvec{x}} ||\le 1} ||{\varvec{K}} {\varvec{x}} ||\). As mentioned in remark 2.1, we can directly apply our results to the PDHG method as below.

figure h

Corollary 6.1

Assume that \(\mathop {\langle {\varvec{P}} ({\varvec{x}} _0 - {\varvec{x}} _*),\, {\varvec{x}} _0 - {\varvec{x}} _* \rangle }\nolimits \le R^2\) for some \({\varvec{x}} _* \in X_*({\varvec{M}})\). The PDHG method satisfies

$$\begin{aligned} \mathop {\langle {\varvec{P}} ({\varvec{x}} _i - {\varvec{x}} _{i-1}),\, {\varvec{x}} _i - {\varvec{x}} _{i-1} \rangle }\nolimits \le \left( 1 - \frac{1}{i}\right) ^{i-1}\frac{R^2}{i} , \end{aligned}$$

and the proposed accelerated PDHG method satisfies

$$\begin{aligned} \mathop {\langle {\varvec{P}} ({\varvec{x}} _i - {\varvec{y}} _{i-1}),\, {\varvec{x}} _i - {\varvec{y}} _{i-1} \rangle }\nolimits \le \frac{R^2}{i^2} .\end{aligned}$$

Example 6.2

We apply the accelerated PDHG method to the bilinear game problem

$$\begin{aligned} \min _{{\varvec{u}} \in \mathbb {R} ^{d_1}}\max _{{\varvec{v}} \in \mathbb {R} ^{d_2}} \mathop {\langle {\varvec{a}},\, {\varvec{u}} \rangle }\nolimits + \mathop {\langle {\varvec{K}} {\varvec{u}},\, {\varvec{v}} \rangle }\nolimits - \mathop {\langle {\varvec{b}},\, {\varvec{v}} \rangle }\nolimits ,\end{aligned}$$
(42)

where \({\varvec{K}} \in \mathbb {R} ^{d_2\times d_1}\), \({\varvec{a}} \in \mathbb {R} ^{d_1}\) and \({\varvec{b}} \in \mathbb {R} ^{d_2}\). The main part of the corresponding method is as below:

$$\begin{aligned} {\varvec{u}} _{i+1}&= \hat{{\varvec{u}}}_i - \tau ({\varvec{K}} ^*\hat{{\varvec{v}}}_i + {\varvec{a}}) \nonumber \\ {\varvec{v}} _{i+1}&= \hat{{\varvec{v}}}_i + \sigma ({\varvec{K}} (2{\varvec{u}} _{i+1} - \hat{{\varvec{u}}}_i) - {\varvec{b}}). \end{aligned}$$
(43)

In the experiment, we choose \(d_1 = 1000\), \(d_2 = 500\), and a matrix \({\varvec{K}} \) and vectors \({\varvec{a}},{\varvec{b}} \) are randomly generated. We run 100 iterations of the PDHG method and its variants with initial \(\hat{{\varvec{u}}}_0 = [10\;\cdots \;10]^\top \), \(\hat{{\varvec{v}}}_0 = [10\;\cdots \;10]^\top \) and \(\tau =\sigma =\frac{0.99}{||{\varvec{K}} ||}\). Figure 4 plots the preconditioned fixed-point residual, where Güler’s first accelerated method diverges. The PDHG method and its proposed accelerated variant are comparable in this experiment, and heuristically restarting the accelerated method every 10 iterations yields a big acceleration. While [11] found restarting (reinitializing) a relaxed PDHG method not useful, our experiment suggests that restarting can be effective in some practical cases.

Fig. 4
figure 4

Solving a bilinear game problem (42) by a preconditioned proximal point method; the preconditioned fixed-point residual versus iteration. \(\tilde{{\varvec{J}}}_{{\varvec{M}},{\varvec{P}}} {:}{=} ({\varvec{P}} + {\varvec{M}})^{-1}{\varvec{P}} \) denotes the preconditioned resolvent operator

6.3 Accelerating the Douglas–Rachford splitting method

This section considers a monotone inclusion problem in a form

$$\begin{aligned} {\text {Find}}\;\; {\varvec{x}} \in {\mathcal {H}}\quad \text {subject to} \quad {\varvec{0}} \in ({\varvec{M}} _1 + {\varvec{M}} _2){\varvec{x}} \end{aligned}$$
(44)

for \({\varvec{M}} _1,{\varvec{M}} _2\in {\mathcal {M}}({\mathcal {H}})\), where \({\varvec{J}} _{\rho {\varvec{M}} _1}\) and \({\varvec{J}} _{\rho {\varvec{M}} _2}\) are more efficient than \({\varvec{J}} _{\rho ({\varvec{M}} _1 + {\varvec{M}} _2)}\) for a positive real number \(\rho \). For such problem, the Douglas–Rachford splitting method [19, 44] that iteratively applies the operator

$$\begin{aligned} {\varvec{G}} _{\rho ,{\varvec{M}} _1,{\varvec{M}} _2}{:}{=} {\varvec{J}} _{\rho {\varvec{M}} _1}{\circ }(2{\varvec{J}} _{\rho {\varvec{M}} _2} - {\varvec{I}}) + ({\varvec{I}}- {\varvec{J}} _{\rho {\varvec{M}} _2}) \end{aligned}$$
(45)

has been found to be effective in many applications including ADMM, which we discuss in the next section.

In [24, Theorem 4], the Douglas–Rachford operator (45) was found to be a resolvent \({\varvec{J}} _{{\varvec{M}} _{\rho ,{\varvec{M}} _1,{\varvec{M}} _2}}\) of a maximally monotone operator

$$\begin{aligned} {\varvec{M}} _{\rho ,{\varvec{M}} _1,{\varvec{M}} _2} {:}{=} {\varvec{G}} _{\rho ,{\varvec{M}} _1,{\varvec{M}} _2}^{-1} - {\varvec{I}}. \end{aligned}$$
(46)

In other words, the Douglas–Rachford splitting method is an instance of the proximal point method (with \(\lambda = 1\)) as

$$\begin{aligned} {\varvec{\nu }} _{i+1} = {\varvec{J}} _{{\varvec{M}} _{\rho ,{\varvec{M}} _1,{\varvec{M}} _2}}({\varvec{\nu }} _i) = {\varvec{G}} _{\rho ,{\varvec{M}} _1,{\varvec{M}} _2}({\varvec{\nu }} _i) \end{aligned}$$
(47)

for \(i=0,1,\ldots \). Therefore, we can apply the proposed acceleration to the Douglas–Rachford splitting method as below.

figure i

Using (8) and (27), we have the following worst-case rates for the Douglas–Rachford splitting method and its accelerated variant. Finding exact bounds for the Douglas–Rachford splitting method and its variant is left as future work; [61] used PEP to analyze the exact worst-case rate of Douglas–Rachford splitting method under some additional conditions.

Corollary 6.2

Assume that \(||{\varvec{\nu }} _0 - {\varvec{\nu }} _*||\le R\) for some \({\varvec{\nu }} _*\in X_*({\varvec{M}} _{\rho ,{\varvec{M}} _1,{\varvec{M}} _2})\). The Douglas–Rachford splitting method satisfies

$$\begin{aligned} ||{\varvec{\nu }} _i - {\varvec{\eta }} _{i-1}||^2 \le \left( 1 - \frac{1}{i}\right) ^{i-1}\frac{R^2}{i} , \end{aligned}$$
(48)

and the proposed accelerated Douglas–Rachford splitting method satisfies

$$\begin{aligned} ||{\varvec{\nu }} _i - {\varvec{\eta }} _{i-1}||^2 \le \frac{R^2}{i^2} . \end{aligned}$$
(49)

[23, 24] illustrated that ADMM is equivalent to the Douglas–Rachford splitting method on the dual problem, so we naturally develop an accelerated ADMM in the next section and provide numerical experiment of the accelerated ADMM and thus the accelerated Douglas–Rachford splitting method.

6.4 Accelerating the alternating direction method of multipliers (ADMM)

Let \({\mathcal {H}}_1,{\mathcal {H}}_2,{\mathcal {G}}\) be real Hilbert spaces equipped with inner product \(\mathop {\langle \cdot ,\, \cdot \rangle }\nolimits \). This section considers a linearly constrained convex problem

$$\begin{aligned} \min _{{\varvec{x}} \in {\mathcal {H}}_1,{\varvec{z}} \in {\mathcal {H}}_2}\;&f({\varvec{x}}) + g({\varvec{z}}) \nonumber \\ \text {subject to }\;&{\varvec{A}} {\varvec{x}} + {\varvec{B}} {\varvec{z}} = {\varvec{c}}, \end{aligned}$$
(50)

where \(f\in {\mathcal {F}}({\mathcal {H}}_1)\), \(g\in {\mathcal {F}}({\mathcal {H}}_2)\), \({\varvec{A}} \in {\mathcal {B}}({\mathcal {H}}_1,{\mathcal {G}})\), \({\varvec{B}} \in {\mathcal {B}}({\mathcal {H}}_2,{\mathcal {G}})\) and \({\varvec{c}} \in {\mathcal {G}}\). Its dual problem is

$$\begin{aligned} \max _{{\varvec{\nu }} \in {\mathcal {G}}}\;&\left\{ - f^*(-{\varvec{A}} ^*{\varvec{\nu }}) - g^*(-{\varvec{B}} ^*{\varvec{\nu }}) + \mathop {\langle {\varvec{c}},\, {\varvec{\nu }} \rangle }\nolimits \right\} , \end{aligned}$$
(51)

where \(f^*({\varvec{y}}) {:}{=} \sup _{{\varvec{x}} \in {\mathcal {H}}_1} \{\mathop {\langle {\varvec{y}},\, {\varvec{x}} \rangle }\nolimits - f({\varvec{x}})\}\) and \(g^*({\varvec{y}}) {:}{=} \sup _{{\varvec{z}} \in {\mathcal {H}}_2} \{\mathop {\langle {\varvec{y}},\, {\varvec{z}} \rangle }\nolimits - g({\varvec{z}})\}\) are the conjugate functions of f and g, respectively. The dual problem (51) is equivalent to the following monotone inclusion problem

$$\begin{aligned} {\text {Find}}\;\; {\varvec{\nu }} \in {\mathcal {G}}\quad \text {subject to} \quad {\varvec{0}} \in -{\varvec{A}} \partial f^*(-{\varvec{A}} ^*{\varvec{\nu }}) - {\varvec{B}} \partial g^*(-{\varvec{B}} ^*{\varvec{\nu }}) - {\varvec{c}} . \end{aligned}$$
(52)

We next use the connection between ADMM for solving (50) and the Douglas–Rachford splitting method for solving (52) in [17, Proposition 9] [60] to develop an accelerated ADMM, using the accelerated Douglas–Rachford splitting method in the previous section.

Denoting

$$\begin{aligned} {\varvec{M}} _1&{:}{=} -{\varvec{A}} \partial f^*(-{\varvec{A}} ^*\cdot ) - {\varvec{c}} \quad \text {and} \quad {\varvec{M}} _2 {:}{=} -{\varvec{B}} \partial g^*(-{\varvec{B}} ^*\cdot ) \end{aligned}$$
(53)

converts the problem (52) into a form of the monotone inclusion problem (44). Then we use the following equivalent form of the accelerated Douglas–Rachford splitting method to solve (44) with (53):

$$\begin{aligned} {\varvec{\zeta }} _{i+1}&= {\varvec{J}} _{\rho {\varvec{M}} _2}({\varvec{\eta }} _i) \nonumber \\ {\varvec{\xi }} _{i+1}&= {\varvec{J}} _{\rho {\varvec{M}} _1}(2{\varvec{\zeta }} _{i+1} - {\varvec{\eta }} _i) \nonumber \\ {\varvec{\nu }} _{i+1}&= {\varvec{\eta }} _i + ({\varvec{\xi }} _{i+1} - {\varvec{\zeta }} _i) \nonumber \\ {\varvec{\eta }} _{i+1}&= {\varvec{\nu }} _{i+1} + \frac{i}{i+2}({\varvec{\nu }} _{i+1} - {\varvec{\nu }} _i) - \frac{i}{i+2}({\varvec{\nu }} _i - {\varvec{\eta }} _{i-1}) \end{aligned}$$
(54)

for \(i=0,1,\ldots \). Replacing the resolvent operators of \({\varvec{M}} _1\) and \({\varvec{M}} _2\) in (53) by minimization steps yields

$$\begin{aligned} {\varvec{z}} _{i+1}&= \mathop {\mathrm{arg}\, \mathrm{min}}\limits _{z\in H_2} \left\{ g({\varvec{z}}) + \mathop {\langle {\varvec{\eta }} _i,\, {\varvec{B}} {\varvec{z}} \rangle }\nolimits + \frac{\rho }{2}||{\varvec{B}} {\varvec{z}} ||^2\right\} \nonumber \\ {\varvec{\zeta }} _{i+1}&= {\varvec{\eta }} _i + \rho {\varvec{B}} {\varvec{z}} _{i+1} \nonumber \\ \tilde{{\varvec{x}}}_{i+1}&= \mathop {\mathrm{arg}\, \mathrm{min}}\limits _{x\in H_1} \left\{ f({\varvec{x}}) + \mathop {\langle {\varvec{\eta }} _i + 2\rho {\varvec{B}} {\varvec{z}} _{i+1},\, {\varvec{A}} {\varvec{x}}- {\varvec{c}} \rangle }\nolimits + \frac{\rho }{2}||{\varvec{A}} {\varvec{x}}- {\varvec{c}} ||^2\right\} \nonumber \\ {\varvec{\xi }} _{i+1}&= {\varvec{\eta }} _i + \rho ({\varvec{A}} \tilde{{\varvec{x}}}_{i+1} - {\varvec{c}}) + 2\rho {\varvec{B}} {\varvec{z}} _{i+1} \nonumber \\ {\varvec{\nu }} _{i+1}&= {\varvec{\eta }} _i + \rho ({\varvec{A}} \tilde{{\varvec{x}}}_{i+1} + {\varvec{B}} {\varvec{z}} _{i+1} - {\varvec{c}}) \nonumber \\ {\varvec{\eta }} _{i+1}&= {\varvec{\nu }} _{i+1} + \frac{i}{i+2}({\varvec{\nu }} _{i+1} - {\varvec{\nu }} _i) - \frac{i}{i+2}({\varvec{\nu }} _i - {\varvec{\eta }} _{i-1}). \end{aligned}$$
(55)

By discarding \({\varvec{\zeta }} _i\) and \({\varvec{\xi }} _i\), and defining

$$\begin{aligned} \hat{{\varvec{\nu }}}_i {:}{=} {\varvec{\nu }} _i - \rho ({\varvec{A}} \tilde{{\varvec{x}}}_i - {\varvec{c}}) \quad \text {and}\quad \hat{{\varvec{\eta }}}_i {:}{=} {\varvec{\eta }} _i - \rho ({\varvec{A}} \tilde{{\varvec{x}}}_i - {\varvec{c}}), \end{aligned}$$
(56)

for \(i=0,1,\ldots \), we have

$$\begin{aligned} {\varvec{z}} _{i+1}&= \mathop {\mathrm{arg}\, \mathrm{min}}\limits _{{z\in {\mathcal {H}}_2}} \left\{ g({\varvec{z}}) + \mathop {\langle \hat{{\varvec{\eta }}}_i + \rho ({\varvec{A}} \tilde{{\varvec{x}}}_i - {\varvec{c}}),\, {\varvec{B}} {\varvec{z}} \rangle }\nolimits + \frac{\rho }{2}||{\varvec{B}} {\varvec{z}} ||^2\right\} \nonumber \\&= \mathop {\mathrm{arg}\, \mathrm{min}}\limits _{z\in {\mathcal {H}}_2} \left\{ g({\varvec{z}}) + \mathop {\langle \hat{{\varvec{\eta }}}_i,\, {\varvec{A}} \tilde{{\varvec{x}}}_i + {\varvec{B}} {\varvec{z}}- {\varvec{c}} \rangle }\nolimits + \frac{\rho }{2}||{\varvec{A}} \tilde{{\varvec{x}}}_i + {\varvec{B}} {\varvec{z}}- {\varvec{c}} ||^2\right\} \nonumber \\ \tilde{{\varvec{x}}}_{i+1}&= \mathop {\mathrm{arg}\, \mathrm{min}}\limits _{x\in {\mathcal {H}}_1} \left\{ f({\varvec{x}}) + \mathop {\langle \hat{{\varvec{\nu }}}_{i+1} + \rho {\varvec{B}} {\varvec{z}} _{i+1},\, {\varvec{A}} {\varvec{x}}- {\varvec{c}} \rangle }\nolimits + \frac{\rho }{2}||{\varvec{A}} {\varvec{x}}- {\varvec{c}} ||^2\right\} \nonumber \\&= \mathop {\mathrm{arg}\, \mathrm{min}}\limits _{x\in {\mathcal {H}}_1} \left\{ f({\varvec{x}}) + \mathop {\langle \hat{{\varvec{\nu }}}_{i+1},\, {\varvec{A}} {\varvec{x}} + {\varvec{B}} {\varvec{z}} _{i+1} - {\varvec{c}} \rangle }\nolimits + \frac{\rho }{2}||{\varvec{A}} {\varvec{x}} + {\varvec{B}} {\varvec{z}} _{i+1} - {\varvec{c}} ||^2\right\} \nonumber \\ \hat{{\varvec{\nu }}}_{i+1}&= \hat{{\varvec{\eta }}}_i + \rho ({\varvec{A}} \tilde{{\varvec{x}}}_i + {\varvec{B}} {\varvec{z}} _{i+1} - {\varvec{c}}) \nonumber \\ \hat{{\varvec{\eta }}}_{i+1}&= \hat{{\varvec{\nu }}}_{i+1} + \frac{i}{i+2}(\hat{{\varvec{\nu }}}_{i+1} - \hat{{\varvec{\nu }}}_i + \rho {\varvec{A}} (\tilde{{\varvec{x}}}_{i+1} - \tilde{{\varvec{x}}}_i)) \nonumber \\&\quad - \frac{i}{i+2}(\hat{{\varvec{\nu }}}_i - \hat{{\varvec{\eta }}}_{i-1} + \rho {\varvec{A}} (\tilde{{\varvec{x}}}_i - \tilde{{\varvec{x}}}_{i-1})) . \end{aligned}$$
(57)

Then, replacing \(\tilde{{\varvec{x}}}_i\) by \({\varvec{x}} _{i+1}\) and reordering steps appropriately yield the following accelerated version of ADMM, which reduces to the standard ADMM when we let \(\hat{{\varvec{\eta }}}_i = \hat{{\varvec{\nu }}}_i\) for \(i=0,1,\ldots \).

figure j

Since

$$\begin{aligned} {\varvec{\nu }} _i - {\varvec{\eta }} _{i-1} = \hat{{\varvec{\nu }}}_i - \hat{{\varvec{\eta }}}_{i-1} - \rho ({\varvec{A}} {\varvec{x}} _{i+1} - {\varvec{A}} {\varvec{x}} _i) = \rho ({\varvec{A}} {\varvec{x}} _{i+1} + {\varvec{B}} {\varvec{z}} _i - {\varvec{c}}), \end{aligned}$$
(58)

we have the following worst-case rates with respect to the infeasibility for ADMM and its accelerated version, using (8) and (27).

Corollary 6.3

Assume that \(||\hat{{\varvec{\nu }}}_0 + \rho {\varvec{A}} ({\varvec{x}} _0 - {\varvec{c}}) - {\varvec{\nu }} _*|| \le R\) for some \({\varvec{\nu }} _*\in X_*({\varvec{M}} _{\rho ,-{\varvec{A}} \partial f^*(-{\varvec{A}} ^*\cdot ) - {\varvec{c}},-{\varvec{B}} \partial g^*(-{\varvec{B}} ^*\cdot )})\). Alternating direction method of multipliers satisfies

$$\begin{aligned} ||{\varvec{A}} {\varvec{x}} _{i+1} + {\varvec{B}} {\varvec{z}} _i - {\varvec{c}} ||^2 \le \left( 1 - \frac{1}{i}\right) ^{i-1}\frac{R^2}{\rho ^2i} ,\end{aligned}$$
(59)

and the proposed accelerated alternating direction method of multipliers satisfies

$$\begin{aligned} ||{\varvec{A}} {\varvec{x}} _{i+1} + {\varvec{B}} {\varvec{z}} _i - {\varvec{c}} ||^2 \le \frac{R^2}{\rho ^2i^2} . \end{aligned}$$
(60)

The bound (59) is e-times asymptotically smaller than the known rate for ADMM in [17, Theorem 15], which originated from the bound (7). Finding exact bounds for the ADMM and its proposed variant is yet left as future work.

Remark 6.1

Many existing rates for the (preconditioned) ADMM consider the ergodic sequences \(\{{\bar{{\varvec{x}}}}_i\}\) and \(\{{\bar{{\varvec{z}}}}_i\}\), where \({\bar{{\varvec{x}}}}_i {:}{=} \frac{1}{i}\sum _{l=1}^i{\varvec{x}} _l\) and \({\bar{{\varvec{z}}}}_i {:}{=} \frac{1}{i}\sum _{l=1}^i{\varvec{z}} _l\) (see e.g., [11, 13, 17, 18]). In particular, in [17, Theorem 15], ADMM is found to satisfy

$$\begin{aligned} ||{\varvec{A}} {\bar{{\varvec{x}}}}_{i+1} + {\varvec{B}} {\bar{{\varvec{z}}}}_i - {\varvec{c}} ||^2 \le \frac{16R^2}{\rho ^2i^2} , \end{aligned}$$
(61)

which is faster than the rate of the nonergodic sequence \(\{{\varvec{x}} _i,{\varvec{z}} _i\}\) of ADMM in (59) and is comparable to the rate of the proposed accelerated ADMM in (60). One should note that the feasibility convergence of the ergodic sequence, as in (61), does not necessarily imply the convergence of the fixed-point residual of the ergodic sequence, unlike (59) and (60) for the nonergodic sequence. In addition, some numerical experiments in [13] illustrate that the performance of the nonergodic sequence can be faster than that of the ergodic sequence. We leave further understanding the rates of the ergodic and nonergodic sequences of (preconditioned) ADMM and their relationship as future work.

Remark 6.2

[11, 13, 28] proposed accelerated variants of (preconditioned) ADMM under some additional conditions, while the proposed method does not require such conditions.

Example 6.3

We apply the accelerated ADMM to the problem

$$\begin{aligned} \min _{{\varvec{x}} \in \mathbb {R} ^{d_1},{\varvec{z}} \in \mathbb {R} ^{d_2}}\;&\frac{1}{2}||{\varvec{H}} {\varvec{x}}- {\varvec{b}} ||^2 + \gamma ||{\varvec{z}} ||_1 \nonumber \\ \text {subject to }\;&{\varvec{D}} {\varvec{x}}- {\varvec{z}} = {\varvec{0}}, \end{aligned}$$
(62)

with a positive real number \(\gamma \), associated with the total-variation-regularized least-squares problem

$$\begin{aligned} \min _{{\varvec{x}} \in \mathbb {R} ^{d_1}}\;&\frac{1}{2}||{\varvec{H}} {\varvec{x}}- {\varvec{b}} ||^2 + \gamma ||{\varvec{D}} {\varvec{x}} ||_1 , \end{aligned}$$
(63)

where \({\varvec{H}} \in \mathbb {R} ^{p\times d_1}\), \({\varvec{b}} \in \mathbb {R} ^p\), and a matrix \({\varvec{D}} \in \mathbb {R} ^{d_2\times d_1}\) is given as

$$\begin{aligned} {\varvec{D}} = \left[ \begin{array}{cccccc} 1 &{} \quad -1 &{} \quad 0 &{} \quad 0 &{} \quad \cdots &{} \quad 0 \\ 0 &{} \quad 1 &{} \quad -1 &{} \quad 0 &{} \quad \cdots &{} \quad 0 \\ \vdots &{} \quad \ddots &{} \quad \ddots &{} \quad \ddots &{} \quad \ddots &{} \quad \vdots \\ \vdots &{} \quad \ddots &{} \quad 0 &{} \quad 1 &{} \quad -1 &{} \quad 0 \\ 0 &{} \quad \cdots &{} \quad \cdots &{} \quad 0 &{} \quad 1 &{} \quad -1 \end{array}\right] .\end{aligned}$$
(64)

By letting \(f({\varvec{x}}) = \frac{1}{2}||{\varvec{H}} {\varvec{x}}- {\varvec{b}} ||^2\), \(g({\varvec{z}}) = \gamma ||{\varvec{z}} ||_1\), \({\varvec{A}} = {\varvec{D}} \), \({\varvec{B}} = -{\varvec{I}} \) and \({\varvec{c}} = {\varvec{0}} \), we have the following accelerated ADMM method:

$$\begin{aligned} {\varvec{x}} _{i+1}&= \mathop {\text {arg min}}\limits _{x\in R^{d_1}} \left\{ \frac{1}{2}||{\varvec{H}} {\varvec{x}}- {\varvec{b}} ||^2 + \mathop {\langle \hat{{\varvec{\nu }}}_i,\, {\varvec{D}} {\varvec{x}}- {\varvec{z}} _i \rangle }\nolimits + \frac{\rho }{2}||{\varvec{D}} {\varvec{x}}- {\varvec{z}} _i||^2\right\} \nonumber \\&= ({\varvec{H}} ^\top {\varvec{H}} + \rho {\varvec{D}} ^\top {\varvec{D}})^{-1}({\varvec{D}} ^\top (\rho {\varvec{z}} _i - \hat{{\varvec{\nu }}}_i) + {\varvec{H}} ^\top {\varvec{b}}) \nonumber \\ \hat{{\varvec{\eta }}}_i&= {\left\{ \begin{array}{ll} \hat{{\varvec{\nu }}}_i, &{} i=0,1, \\ \hat{{\varvec{\nu }}}_i + \frac{i-1}{i+1}(\hat{{\varvec{\nu }}}_i - \hat{{\varvec{\nu }}}_{i-1} + \rho {\varvec{D}} ({\varvec{x}} _{i+1} - {\varvec{x}} _i)) - \frac{i-1}{i+1}(\hat{{\varvec{\nu }}}_{i-1} - \hat{{\varvec{\eta }}}_{i-2} + \rho {\varvec{D}} ({\varvec{x}} _i - {\varvec{x}} _{i-1})), &{} i=2,3,\ldots \end{array}\right. } \nonumber \\ {\varvec{z}} _{i+1}&= \mathop {\text {arg min}}\limits _{z\in R^{d_2}} \left\{ \gamma ||{\varvec{z}} ||_1 + \mathop {\langle \hat{{\varvec{\eta }}}_i,\, {\varvec{D}} {\varvec{x}} _{i+1} - {\varvec{z}} \rangle }\nolimits + \frac{\rho }{2}||{\varvec{D}} {\varvec{x}} _{i+1} - {\varvec{z}} ||^2\right\} \nonumber \\&= {\text {S}}_{\frac{\gamma }{\rho }}\left( {\varvec{D}} {\varvec{x}} _{i+1} + \frac{1}{\rho }\hat{{\varvec{\eta }}}_i\right) \nonumber \\ \hat{{\varvec{\nu }}}_{i+1}&= \hat{{\varvec{\eta }}}_i + \rho ({\varvec{D}} {\varvec{x}} _{i+1} - {\varvec{z}} _{i+1}), \end{aligned}$$
(65)

where the soft-thresholding operator is defined as \({\text {S}}_{\tau }({\varvec{z}}) {:}{=} \max \{|{\varvec{z}} |-\tau ,{\varvec{0}} \} \odot {\text {sign}}({\varvec{z}})\) with the element-wise absolute value, maximum and multiplication operators, \(|\cdot |\), \(\max \{\cdot ,\cdot \}\) and \(\odot \), respectively.

In the experiment, we choose \(d_1 = 100\), \(d_2 = 99\), \(p=5\), and a true vector \({\varvec{x}} _{\mathrm {true}}\) is constructed such that a vector \({\varvec{D}} {\varvec{x}} _{\mathrm {true}}\) has few nonzero elements. A matrix \({\varvec{H}} \) is randomly generated and a noisy vector \({\varvec{b}} \) is generated by adding randomly generated (noise) vector to \({\varvec{H}} {\varvec{x}} _{\mathrm {true}}\). We choose the parameters \(\gamma = 3\) and \(\rho = 0.05\) in the experiment.

Figure 5 illustrates the fixed-point residual of ADMM and its accelerated variants. Interestingly, ADMM has a rate comparable to the \(O(1/i^2)\) rate of the proposed method. This does not contradict with the theory, and we leave further investigating the worst-case rate of ADMM under the Lipschitz continuity condition of \(\nabla f\); similar analysis but under different conditions can be found in [17, 18]. Noticing the oscillating behavior of the proposed ADMM in Fig. 5, we heuristically restarted the proposed method every 20 iterations, yielding a linear rate, without a strong monotonicity condition.Footnote 4 Restarting has been previously found useful for a different accelerated ADMM in [28].

Fig. 5
figure 5

Solving a total-variation-regularized least-squares problem (63); the fixed-point residual versus iteration

7 Accelerated forward method for cocoercive operators

This section applies the proposed acceleration to a forward method, such as a gradient method, for cocoercive operators. A single-valued operator \({\varvec{M}} \;:\;{\mathcal {H}}\rightarrow {\mathcal {H}}\) is \(\beta \)-cocoercive for \(\beta \in \mathbb {R} _{++}\) if

$$\begin{aligned} \mathop {\langle {\varvec{x}}- {\varvec{y}},\, {\varvec{M}} {\varvec{x}}- {\varvec{M}} {\varvec{y}} \rangle }\nolimits \ge \beta ||{\varvec{M}} {\varvec{x}}- {\varvec{M}} {\varvec{y}} ||^2\text { for all } {\varvec{x}},{\varvec{y}} \in {\mathcal {H}}. \end{aligned}$$
(66)

Let \({\mathcal {C}}_{\beta }({\mathcal {H}})\) be the class of \(\beta \)-cocoercive operators on \({\mathcal {H}}\). For the \(\beta \)-cocoercive operator, the following forward method (that iteratively applies the forward operator \({\varvec{I}}- \beta {\varvec{M}} \)) is guaranteed to converge weakly to a solution [7, Theorem 26.14].

figure k

An operator \({\varvec{T}} \;:\;{\mathcal {H}}\rightarrow {\mathcal {H}}\) is \(\lambda \)-cocoercive if and only if it is the Yosida approximation of index \(\lambda \) [7, Proposition 23.21]:

$$\begin{aligned} {\varvec{M}} _\lambda {:}{=} \frac{1}{\lambda }({\varvec{I}}- {\varvec{J}} _{\lambda {\varvec{M}}}) \end{aligned}$$
(67)

of a maximally monotone operator \({\varvec{M}} \;:\;{\mathcal {H}}\rightarrow 2^{\mathcal {H}}\). We thus have the following equivalence between the resolvent (backward) operator of a maximally monotone operator \({\varvec{M}} \) and a forward operator of the corresponding cocoercive operator \({\varvec{M}} _\lambda \):

$$\begin{aligned} {\varvec{J}} _{\lambda {\varvec{M}}} = ({\varvec{I}} + \lambda {\varvec{M}})^{-1} = {\varvec{I}}- \lambda {\varvec{M}} _\lambda .\end{aligned}$$
(68)

Therefore, the results on the proximal point method and its accelerated variant for monotone operators directly apply to the forward method and its accelerated variant for cocoercive operators.

8 Conclusion

This paper developed an accelerated proximal point method for maximally monotone operators, with respect to the fixed-point residual, using the computer-assisted performance estimation problem approach. Restarting technique was further employed under the strong monotonicity condition. The proposed acceleration was applied to various instances of the proximal point method such as the proximal method of multipliers, the primal–dual hybrid gradient method, the Douglas–Rachford splitting method, and the alternating direction method of multipliers, yielding accelerations both theoretically and practically. The acceleration was also applied to a forward method for cocoercive operators.

We leave developing accelerations for more general or more specific classes of problems or methods as future work, possibly via the performance estimation problem approach; a comprehensive understanding of accelerations for the alternating direction method of multipliers with respect to various performance measures under various conditions are yet remain open.