Abstract
We develop new perturbation techniques for conducting convergence analysis of various first-order algorithms for a class of nonsmooth optimization problems. We consider the iteration scheme of an algorithm to construct a perturbed stationary point set-valued map, and define the perturbing parameter by the difference of two consecutive iterates. Then, we show that the calmness condition of the induced set-valued map, together with a local version of the proper separation of stationary value condition, is a sufficient condition to ensure the linear convergence of the algorithm. The equivalence of the calmness condition to the one for the canonically perturbed stationary point set-valued map is proved, and this equivalence allows us to derive some sufficient conditions for calmness by using some recent developments in variational analysis. These sufficient conditions are different from existing results (especially, those error-bound-based ones) in that they can be easily verified for many concrete application models. Our analysis is focused on the fundamental proximal gradient (PG) method, and it enables us to show that any accumulation of the sequence generated by the PG method must be a stationary point in terms of the proximal subdifferential, instead of the limiting subdifferential. This result finds the surprising fact that the solution quality found by the PG method is in general superior. Our analysis also leads to some improvement for the linear convergence results of the PG method in the convex case. The new perturbation technique can be conveniently used to derive linear rate convergence of a number of other first-order methods including the well-known alternating direction method of multipliers and primal-dual hybrid gradient method, under mild assumptions.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
Attouch, H., Bolte, J., Svaiter, B.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1), 91–129 (2013)
Aubin, J.: Lipschitz behavior of solutions to convex minimization problems. Math. Oper. Res. 9(1), 87–111 (1984)
Bai, K., Ye, J. J., Zhang, J.: Directional quasi-/pseudo-normality as sufficient conditions for metric subregularity. SIAM J. Optim. 29(4), 2625–2649 (2019)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Bello-Cruz, J., Li, G., Nghia, T. T. A.: On the Q-linear convergence of forward-backward splitting method and uniqueness of optimal solution to lasso. arXiv:1806.06333 (2018)
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)
Bolte, J, Daniilidis, A., Ley, O., Mazet, L.: Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Amer. Math. Soc. 362(6), 3319–3363 (2010)
Bolte, J., Nguyen, T. P., Peypouquet, J., Suter, B. W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165, 471–507 (2017)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1), 459–494 (2014)
Cauchy, A. L.: Méthode générale pour la résolution des systèms d’equations simultanées. Comptes Rendus de l’Académie des Sciences 25, 46–89 (1847)
Chambolle, A., Pock, T.: A first-order primal-dual algorithms for convex problem with applications to imaging. J. Math. Imaging Vis. 40, 120–145 (2011)
Clarke, F., Ledyaev, Y., Stern, R., Wolenski, P.: Nonsmooth Analysis and Control Theory, vol. 178. Springer Science & Business Media (2008)
Clarke, F., Stern, R., Wolenski, P.: Subgradient critieria for montonicity, the Lipschitz condition, and convexity. Can. J. Math. 45, 1167–1183 (1993)
Dontchev, A., Rockafellar, R. T.: Implicit Functions and Solution Mappings. Springer Monographs in Mathematics (2009)
Drusvyatskiy, D., Ioffe, A., Lewis, A.: Nonsmooth optimization using taylor-like models: error bounds, convergence, and termination criteria. Math. Program., 1–27 (2019)
Drusvyatskiy, D., Lewis, A.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018)
Drusvyatskiy, D., Mordukhovich, B. S., Nghia, T. T. A.: Second-order growth, tilt stability, and metric regularity of the subdifferential. J. Convex Anal. 21(4), 1165–1192 (2014)
Fan, J. Q., Li, R. Z.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Stat. Assoc. 96(456), 1348–1360 (2001)
Frankel, P., Garrigos, G., Peypouquet, J.: Splitting methods with variable metric for Kurdyka-Łojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165(3), 874–900 (2015)
Gfrerer, H.: On directional metric regularity, subregularity and optimality conditions for nonsmooth mathematical programs. Set-Valued Var. Anal. 21(2), 151–176 (2013)
Gfrerer, H.: Optimality conditions for disjunctive programs based on generalized differentiation with application to mathematical programs with equilibrium constraints. SIAM J. Optim. 24(2), 898–931 (2014)
Gfrerer, H., Klatte, D.: Lipschitz and Hölder stability of optimization problems and generalized equations. Math. Program. 158(1), 35–75 (2016)
Gfrerer, H., Outrata, J.: On Lipschitzian properties of implicit multifunctions. SIAM J. Optim. 26(4), 2160–2189 (2016)
Gfrerer, H., Ye, J. J.: New constraint qualifications for mathematical programs with equilibrium constraints via variational analysis. SIAM J. Optim. 27 (2), 842–865 (2017)
Ginchev, I., Mordukhovich, B. S.: On directionally dependent subdifferentials. Comptes rendus de L’Academie Bulgare des Sciences 64, 497–508 (2011)
Goodfellow, T., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
He, B., Fu, X., Jiang, Z.: Proximal-point algorithm using a linear proximal term. J. Optim. Theory Appl. 141(2), 299–319 (2009)
He, B., Yuan, X.: Convergence analysis of primal-dual algorithms for a saddle-point problem: from contraction perspective. J. Math. Imaging Vis. 5(1), 119–149 (2012)
Henrion, R., Jourani, A., Outrata, J.: On the calmness of a class of multifunctions. SIAM J. Optim. 13(2), 603–618 (2002)
Henrion, R., Outrata, J.: Calmness of constraint systems with applications. Math. Program. 104(2–3), 437–464 (2005)
Hoffman, A. J.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bur. Stand. 49(4), 263–265 (1952)
Klatte, D., Kummer, B.: Constrained minima and lipschitzian penalties in metric spaces. SIAM J. Optim. 13(2), 619–633 (2002)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Li, G., Pong, T. -K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comut. Math. 18, 1199–1232 (2018)
Liu, Y., Yuan, X., Zeng, S., Zhang, J.: Partial error bound conditions and the linear convergence rate of alternating direction method of multipliers. SIAM J. Numer. Anal. 56(4), 2095–2123 (2018)
Luo, Z. -Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46(1), 157–178 (1993)
Luo, Z. -Q., Tseng, P.: Error bound and convergence analysis of matrix splitting algorithms for the affine variational inequality problem. SIAM J. Optim. 2(1), 43–54 (2006)
Mordukhovich, B.: Variational Analysis and Generalized Differentiation I: Basic theory, II: Applications. Springer Science & Business Media (2006)
Nesterov, Y.: Introductory Lectures on Convex Optimization: a Basic Course, vol. 87. Springer Science & Business Media (2013)
Noll, D.: Convergence of non-smooth descent methods using the Kurdyka–Łojasiewicz inequality. J. Optim. Theory Appl. 160(2), 553–572 (2014)
O’donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comut. Math. 15(3), 715–732 (2015)
Pan, S., Liu, Y. (2019)
Passty, G.: Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. J Math. Anal. Appl. 72(2), 383–390 (1979)
Polyak, B. T.: Introduction to Optimization. Optimization Software Incorporation, New York (1987)
Reed, R.: Pruning algorithms - a survey. IEEE Trans. Neural Netw. 4(5), 740–747 (1993)
Robinson, S.: Stability theory for systems of inequalities. Part i: linear systems. SIAM J. Numer. Anal. 12(5), 754–769 (1975)
Robinson, S.: Strongly regular generalized equations. Math. Oper. Res. 5(1), 43–62 (1980)
Robinson, S.: Some continuity properties of polyhedral multifunctions. Math. Program. Stud. 14, 206–214 (1981)
Rockafellar, R. T., Wets, R.: Variational Analysis. Springer Science & Business Media (2009)
Schmidt, M., Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems, vol. 24, pp 1458–1466 (2011)
Tao, S., Boley, D., Zhang, S.: Local linear convergence of ISTA and FISTA on the LASSO problem. SIAM J. Optim. 26(1), 313–336 (2016)
Tseng, P.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. 125(2), 263–295 (2010)
Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117(1), 387–423 (2009)
Wen, B., Chen, X., Pong, T. -K.: Linear convergence of proximal gradient algorithm with extrapolation for a class of nonconvex nonsmooth minimization problems. SIAM J. Optim. 27(1), 124–145 (2017)
Xiao, L., Zhang, T.: A proximal-gradient homotopy method for the sparse least-squares problem. SIAM J. Optim. 23(2), 1062–1091 (2013)
Yang, W., Han, D.: Linear convergence of the alternating direction method of multipliers for a class of convex optimization problems. SIAM J. Numer. Anal. 54, 625–640 (2016)
Ye, J. J.: Constraint qualifications and necessary optimality conditions for optimization problems with variational inequality constraints. SIAM J. Optim. 10(4), 943–962 (2000)
Ye, J. J., Ye, X. Y.: Necessary optimality conditions for optimization problems with variational inequality constraints. Math. Oper. Res. 22(4), 977–997 (1997)
Ye, J. J., Yuan, X., Zeng, S., Zhang, J.: Variational analysis perspective on linear convergence of some first order methods for nonsmooth convex optimization problems. Optimization-online preprint (2018)
Ye, J. J., Zhou, J.: Verifiable sufficient conditions for the error bound property of second-order cone complementarity problems. Math. Program. 171, 361–395 (2018)
Yuan, X., Zeng, S., Zhang, J.: Discerning the linear convergence of ADMM for structured convex optimization through the lens of variational analysis. J. Mach. Learn. Res. 21, 1–75 (2020)
Zhang, C. -H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Zhou, Z., So, A. M. -C.: A unified approach to error bounds for structured convex optimization problems. Math. Program. 165(2), 689–728 (2017)
Acknowledgments
The authors would like to thank the anonymous referees for their helpful suggestions and comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research was partially supported by NSFC No. 11871279 and 11971090, NSERC, the General Research Fund from Hong Kong Research Grants Council: 12302318, National Science Foundation of China 11971220, Guangdong Basic and Applied Basic Research Foundation 2019A1515011152
Appendix A
Appendix A
1.1 A.1 Proof of Lemma 2
(1) Since xk+ 1 is the optimal solution of the proximal operation (1.3) with a = xk − γ∇f(xk), we have
which can be reformulated as
Furthermore, since ∇f(x) is globally Lipschitz continuous with the Lipschitz constant L, we have
Adding the above inequality to (A.1) we obtain
As a result if \(\gamma < \frac {1}{L}\) we have (3.1) with \(\kappa _{1} := \frac {1}{2\gamma } - \frac {L}{2}\).
(2) By the optimality of xk+ 1 we have that for any x,
which can be reformulated as
By the Lipschitz continuity of ∇f(x),
By the above two inequalities we obtain
from which we can obtain (3.2) with \(\kappa _{2} := \max \limits \left \{ \left (\frac {1}{\gamma } + \frac {L+1}{2} \right ), \left (\frac {L}{2} + \frac {1}{2\gamma } \right ) \right \}\).
1.2 A.2 Proof of Theorem 5
In the proof, we denote by \(\zeta :=F(\bar x)\) for succinctness. And we recall that the proper separation of the stationary value condition holds on \(\bar {x} \in {\mathcal {X}}^{\pi }\), i.e., there exists δ > 0 such that
Without lost of generality, we assume that 𝜖 < δ/(κ + 1) throughout the proof.
Step 1. We prove that \(\bar x\) is a stationary point and
Adding the inequalities in (3.1) starting from iteration k = 0 to an arbitrary positive integer K, we obtain
It follows that \({\sum }_{k=0}^{\infty } \left \| x^{k+1} - x^{k} \right \|^{2} < \infty ,\) and consequently (A.3) holds. Let \(\{ x^{k_{i}} \}_{i=1}^{\infty }\) be a convergent subsequence of \(\left \{ x^{k} \right \}\) such that \(x^{k_{i}}\rightarrow \bar {x}\) as \(i\rightarrow \infty \). Then by (A.3), we have
Since
let \(i\rightarrow \infty \) in (A.5) and by the outer semicontinuity of \(\text {Prox}_{g}^{\gamma } (\cdot )\) (see [50, Theorem 1.25]) and continuity of ∇f, we have
Using the definition of the proximal operator and applying the optimality condition and we have
and so \(\bar x\in \mathcal {X}^{\pi }\).
Step 2. Given \(\hat {\epsilon } > 0\) such that \(\hat {\epsilon } < \delta /\epsilon - \kappa -1\), for each k > 0, we can find \({{\bar {x}^{k} \in \mathcal {X}^{\pi }}}\) such that
It follows by the cost-to-estimate condition (3.2) we have
with \(\hat { \kappa }_{2} = \kappa _{2}(1+\hat {\epsilon })\). Now we use the method of mathematical induction to prove that there exists kℓ > 0 such that for all j ≥ kℓ,
where the constant \(c:=\frac {2\sqrt {\hat { \kappa }_{2}(\kappa ^{2}+1)}}{\kappa _{1}}> 0\).
By (A.4) and the fact that F is continuous in its domain, there exists kℓ > 0 such that \(x^{k_{\ell }}\in \mathbb {B} \left (\bar {x},\epsilon \right )\), \(x^{k_{\ell }+1}\in \mathbb {B} \left (\bar {x},\epsilon \right )\),
which indicates \(\bar {x}^{k_{\ell }}\in {{{\mathcal {X}}^{\pi }}} \cap {\mathbb {B}}\left (\bar {x},\delta \right )\). It follows by the proper separation of the stationary value condition (A.2) that \(F\left (\bar {x}^{k_{\ell }} \right ) = \zeta \).
Before inducing (A.7)–(A.9), we should get ready by showing that for j ≥ kℓ, if (A.7) and (A.8) hold, then
Firstly, since \(x^{j}\in \mathbb {B} \left (\bar {x}, \epsilon \right )\), \(F(\bar {x}^{j}) = \zeta \) and (A.6) holds, it follows from (3.4) that
where \(\kappa _{3} := \sqrt {\hat { \kappa }_{2} \left (\kappa ^{2} + 1 \right )}\). Similarly, since \(x^{j+1}\in \mathbb {B} \left (\bar {x}, \epsilon \right )\) and \(F(\bar {x}^{j+1}) = \zeta \), by (A.6) and condition (3.4), we have
As a result, we can obtain
After defining \(c: = \frac {2\kappa _{3}}{\kappa _{1}}\), we have
from which by applying \(ab\le \left (\frac {a+b}{2} \right )^{2}\) we establish (A.12).
Next we proceed to prove the three properties (A.7)–(A.9) by induction on j. For j = kℓ, we have
and similar to the estimate of \(\left \|\bar {x}^{k_{\ell }} - \bar {x}\right \|\), we can show
It follows by (A.2) that \(F(\bar {x}^{k_{\ell }+1}) = \zeta \), and hence by (A.6),
which is (A.8) with j = kℓ. Note that property (A.9) for j = kℓ can be obtained directly through (A.12).
Now suppose (A.7) (A.8) and (A.9) hold for certain j > kℓ. By induction we also want to show that (A.7) (A.8) and (A.9) hold for j + 1. We have
where the second inequality follows from (A.9) and (A.11) and the last inequality follows from (A.10). Since \(x^{j+2}\in \mathbb {B}(\bar x, \epsilon )\), by the definition of \(\bar {x}^{j}\) and (3.4), there holds that
where the third inequality follows from (A.11). It follows from the proper separation of stationary value assumption (A.2) that \(F(\bar {x}^{j+2}) = \zeta \). Consequently by (A.6), we have
So far we have shown that (A.7)-(A.8) hold for j + 1. Moreover
from which we obtain (A.9) for j + 1. The desired induction on j is now complete. In summary, we have now proved the properties (A.7)–(A.9).
Step 3. We prove that the whole sequence {xk} converges to \(\bar x\) and (3.5)–(3.6) hold.
By (A.9), for all j ≥ kℓ
which indicates that \(\left \{ x^{k} \right \}\) is a Cauchy sequence. It follows that the whole sequence converges to the stationary point \(\bar {x}\). Further for all k ≥ kℓ, we have \(x^{k}\in \mathbb {B}(\bar {x},\epsilon )\). As a result, the PG-iteration-based error bound condition (3.4) holds on all the iteration points \(\left \{ x^{k} \right \}_{k > k_{\ell }}\). Recall that by (3.1) and (A.14), we have
which implies that
We can observe easily that
Thus we have
which completes the proof of (3.5).
Inspired by [6], we have following linear convergence result for sequence {xk}. Recall the sufficient descent property (3.1),
which indicates that there exists a constant C such that
In addition, we have that
which implies (3.6) with \(\rho _{0} = \frac {C}{1-\sqrt {\sigma }}\).
Rights and permissions
About this article
Cite this article
Wang, X., Ye, J.J., Yuan, X. et al. Perturbation Techniques for Convergence Analysis of Proximal Gradient Method and Other First-Order Algorithms via Variational Analysis. Set-Valued Var. Anal 30, 39–79 (2022). https://doi.org/10.1007/s11228-020-00570-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11228-020-00570-0
Keywords
- Calmness
- Error bound
- Proximal gradient method
- Linear convergence
- Variational analysis
- Alternating direction method of multipliers