On monotonic estimates of the norm of the minimizers of regularized quadratic functions in Krylov spaces

Cartis, Coralia; Gould, Nicholas I. M.; Lange, Marius

doi:10.1007/s10543-019-00791-2

On monotonic estimates of the norm of the minimizers of regularized quadratic functions in Krylov spaces

Published: 13 December 2019

Volume 60, pages 583–589, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

BIT Numerical Mathematics Aims and scope Submit manuscript

On monotonic estimates of the norm of the minimizers of regularized quadratic functions in Krylov spaces

Download PDF

302 Accesses
Explore all metrics

Abstract

We show that the minimizers of regularized quadratic functions restricted to their natural Krylov spaces increase in Euclidean norm as the spaces expand.

On the Properties of the Method of Minimization for Convex Functions with Relaxation on the Distance to Extremum

Article 01 January 2019

New Generalized Convexity and Their Applications

A Local Nearly Linearly Convergent First-Order Method for Nonsmooth Functions with Quadratic Growth

Article 14 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Given a real symmetric, possibly indefinite, matrix H and vector g, we are concerned with Krylov methods for approximating the global solution of the possibly nonconvex regularization problem

(1.1)

where $\sigma > 0$, $p > 2$ and $\Vert \cdot \Vert $ is the Euclidean norm—note that Q is bounded below over $\mathfrak {R}^n$, and all global minimizers have the same norm [7, §3]. Such methods have been advocated by a number of authors, e.g., [1,2,3, 8]. Here we are interested in how the norms of the estimates of the solution evolve as the Krylov process proceeds. The main utility is that these estimates provide useful predictions for the “multipliers” $\sigma \Vert x\Vert ^{p-2}$ as the Krylov subspace expands [9]. Our result is an analogue of that obtained by Lukšan, Matonoha and Vlček [11] for the trust-region subproblem.

By way of motivation and explanation, the solution $x_*$ to (1.1) necessarily satisfies the first-order criticality condition $\nabla _x Q(x_*;\sigma ,p) = 0$, i.e.,

$$\begin{aligned} {(H + \mu _* I) x_* = - g, \;\; \text{ where } \;\; \mu _* = \sigma \Vert x_*\Vert ^{p-2}.} \end{aligned}$$

(1.2)

In addition, $H + \mu _* I$ is positive semi-definite at any global minimizer, and if $H + \mu _* I$ is positive definite, the minimizer is unique [3, Thm.3.1]. If $\mu _*$ was known, the minimizer might be found simply by solving the linear system (1.2), and the skill is then in finding convergent estimates of $\mu _*$ by iteration [5]. Briefly, this is achieved by seeking the rightmost root of the secular equation

$$\begin{aligned} \Vert x( \mu )\Vert ^{p-2} - \frac{\mu }{\sigma } = 0, \end{aligned}$$

where

$$\begin{aligned} {( H + \mu I ) x( \mu ) = - g} \end{aligned}$$

(1.3)

while ensuring that $ H + \mu I$ is positive semi-definite. This is always possible so long as g intersects the subspace $\mathcal{U}$ of eigenvectors of H corresponding to the leftmost eigenvalue $\lambda _{\min }(H)$ of H, and in this case $H + \mu _* I$ is positive definite. The rare possibility that the later does not happen is known colloquially as the “hard case” [12], and the solution to (1.1) in the hard case involves an addition component from $\mathcal{U}$. We also make the connection between (1.1) and the regularized quadratic

$$\begin{aligned} {Q_{\nu }(x) :={\scriptstyle \frac{1}{2}}x^T ( H + \nu I ) x + g^T x,} \end{aligned}$$

(1.4)

namely that if $\nu = \mu _*$ and the hard case does not occur then $x_*$ minimizes $Q_{\nu }(x)$.

This all presupposes that one can solve the linear system (1.3), and unfortunately in some applications matrix factorization is out of the question, indeed H may only be available indirectly via Hessian-vector products Hv for given v. An attractive alternative in such cases is to find an approximation to the solution of (1.1) by restricting the search domain to a subspace or sequence of subspaces. A particularly appealing set of nested subspaces is provided by the Krylov space defined by H and g that we will define formally in the next section. Crucially, the k-th Krylov subspace, $\mathcal{K}_k(H,g)$, may be generated recursively through Hessian-vector products, and has an orthogonal basis $V_k$, with the desirable property that if

then $x_k = V_k \bar{x}_k$, where

(1.5)

the vector $e_1 \in \mathfrak {R}^k$ is the first column of the identity matrix and $T_k^{} = V_k^T H V_k^{} \in \mathfrak {R}^{k \times k}$ is tridiagonal. The latter implies that solving (1.5) is feasible via its optimality equations

$$\begin{aligned} { (T_k + \mu _k I) \bar{x}_k = - \Vert g\Vert e_1, \;\; \text{ where } \;\; \mu _k = \sigma \Vert x_k\Vert ^{p-2},} \end{aligned}$$

even when the dimension is large, since factorizing shifted tridiagonal matrices and solving linear systems involving them may be achieved in a few multiples of k floating-point operations.

Of course, we need to judge when $x_k$ is a meaningful approximation to $x_*$ as the subspaces evolve, and furthermore to solve each successive subproblem (1.5) efficiently. The former is addressed in [2, 6] and requires estimates of $\mu _k$, while the latter appeals to the ideas in [5] and relies on a good starting “guess” for $\mu _k$. Thus generating a good starting guess provides motivation for our short paper.

In the next section we provide a set of lemmas leading to our main result, namely that so long as the evolving Krylov subspaces are of full dimensionality, the norms of the solution estimates $\Vert s_k\Vert $ and the corresponding “multipliers” $\mu _k$ increase monotonically. We summarize, extend and discuss implications and limitations of our results in the concluding Sect. 3.

2 The main result

We start with four vital lemmas that we use to prove our main result. The first shows a simple property of the conjugate gradient method. We use the generic notation $H \succ 0$ (resp. $H \succeq 0$) to mean that the real, symmetric matrix H is positive definite (resp. positive semi-definite).

Lemma 1

Given a real symmetric matrix H and real vector g, let

$k \ge 1$, be the k-th Krylov subspace generated by H and the vector g, and let the columns of $V_k$ provide an orthonormal basis for $\mathcal{K}_k(H,g)$. Letting $\ell \ge k \ge 1$, suppose that

$$\begin{aligned} {V_{\ell }^T H V_{\ell }^{} \succ 0} \end{aligned}$$

(2.1)

and define

Then

$$\begin{aligned} {\Vert x_k \Vert \le \Vert x_{\ell } \Vert .} \end{aligned}$$

Proof

This follows from [4, Thm.7.5.1] as the requirement there, namely that $p_k^T H p_k^{} > 0$ for specific vectors $p_k \in \mathcal{K}_k(H,g)$, is implied by the more general assumption (2.1). $\square $

Note that this is a generalization of [13, Thm.2.1] that relaxes the requirement that H be everywhere positive definite to be so merely over the evolving Krylov subspaces of interest.

Our second lemma compares Krylov subspaces of the matrices H and $H +\mu I$ for some $\mu \in \mathfrak {R}$.

Lemma 2

[11, Lem.2.3]. Let H, g and $\mathcal{K}_k$ be as in Lemma 1, and $\mu \in \mathfrak {R}$. Then

$$\begin{aligned} {\mathcal{K}_k (H+\mu I, g ) = \mathcal{K}_k(H,g).} \end{aligned}$$

(2.2)

Next, we state a crucial relation between the parameter $\nu $ that defines $Q_{\nu }(x)$ in (1.4) and the norm of the minimizer of $Q_{\nu }(x)$ within the k-th Krylov space.

Lemma 3

[11, Lem.2.5]. Suppose that the columns of $V_k$ provide an orthonormal basis for $\mathcal{K}_k(H,g)$ for given real symmetric H and real g. Let $V_k^T H V_k^{} + \nu _i^{} I$, $\nu _i \in \mathfrak {R}$, $i \in \{1, 2\}$, be symmetric and positive definite. Let

Then

$$\begin{aligned} {\nu _2 \le \nu _1 \;\; \text{ if } \text{ and } \text{ only } \text{ if } \;\; \Vert x_k (\nu _2 )\Vert \ge \Vert x_k (\nu _1 )\Vert .} \end{aligned}$$

We define the grade of H and g, $\text{ grade }(H,g) \le n$, to be the maximum dimension of the evolving Krylov spaces $\mathcal{K}_k(H,g)$, $k = 1, \ldots , n$ [10]. Our final lemma indicates that the evolving minimizers are unique.

Lemma 4

Let H, g and $V_k$ be as in Lemma 3 and let $\mu _k$ by the rightmost root of the secular equation

$$\begin{aligned} { \Vert \bar{x}_k^{}(\mu )\Vert ^{p-2} - \frac{\mu }{\sigma } = 0, \;\; \text{ where } \;\; ( V_k^T H V_k^{} + \mu I ) \bar{x}_k^{}( \mu ) = - V_k^T g} \end{aligned}$$

(2.3)

Then $V_k^T H V_k^{} + \mu _k^{} I \succ 0$ for all $1 \le k \le m :=\text{ grade }(H,g)$.

Proof

Using the Lanczos orthonormal basis, we have that $V_k^T H V_k^{} = T_k^{}$ for an irreducible tridiagonal matrix $T_k$ for $k = 1, \ldots , m$. It then follows [4, Thm.7.5.12] that $\mathcal{K}_k(H,g)$ has a nontrivial intersection with the space of eigenvectors of $T_k$ corresponding to the eigenvalue $\lambda _{\min }(T_k)$ (i.e., the “hard case” cannot occur), and thus that the only permitted root $\mu _k$ of the secular Eq. (2.3) for the problem satisfies $\mu _k > - \lambda _{\min }(T_k)$, where $\lambda _{\min }$ denotes the leftmost eigenvalue of its symmetric matrix argument [5, Sec.2.2]. $\square $

We are now in a position to state and prove our main theorem.

Theorem 1

Given a real symmetric matrix H, vector g and scalars $\sigma > 0$ and $p > 2$, let $m = \text{ grade }(H,g)$,

and

$$\begin{aligned} {\mu _j = \sigma \Vert x_j\Vert ^{p-2}} \end{aligned}$$

(2.4)

for $j \ge 1$. Then $\mu _k \le \mu _{\ell }$ and $\Vert x_k\Vert \le \Vert x_{\ell }\Vert $ for $1 \le k \le \ell \le m$.

Proof

Let $V_j$ be as in the statement of Lemma 3. The vector $x_j = V_j y_j$ is a minimizer of the j-th regularization subproblem if and only if

$$\begin{aligned} {V_j^T (H + \mu _j I) V_j y_j = -V_j^T g, \;\; V_j^T (H + \mu _j I) V_j \succeq 0, \;\; \text{ and } \;\; \mu _j = \sigma \Vert y_j\Vert ^{p-2},} \end{aligned}$$

(2.5)

and the minimizer is unique since $V_j^T (H + \mu _j I) V_j \succ 0$ from Lemma 4 [5, Thm.2]. Consider two integers k and $\ell $ for which $1 \le k \le \ell \le m$.

Since we have $V_k^T (H + \mu _k I)V_k \succ 0$ and $V_{\ell }^T (H + \mu _{\ell } I)V_{\ell } \succ 0$, and as $\mathcal{K}_k (H+\mu _kI,g) = \mathcal{K}_k(H,g)$ by Lemma 2, it follows from (2.5) that $x_k$ is also the (unique) solution of the subspace minimization problem

Assume that $\mu _k > \mu _{\ell }$, which implies that $V_{\ell }^T (H + \mu _k I)V_{\ell } \succ 0$. Let

Then it follows from Lemma 1 that

(2.6)

But since $\mu _{\ell } < \mu _k$, Lemma 3 gives that

$$\begin{aligned} {\Vert x_{\ell } (\mu _k )\Vert \le \Vert x_{\ell } (\mu _{\ell } )\Vert = \Vert x_{\ell }\Vert .} \end{aligned}$$

(2.7)

Hence using the definition (2.4) and combining the inequalities (2.6) and (2.7)

$$\begin{aligned} {\mu _k = \sigma \Vert x_k\Vert ^{p-2} \le \sigma \Vert x_{\ell }\Vert ^{p-2} = \mu _{\ell } < \mu _k} \end{aligned}$$

which is a contradiction. Thus $\mu _k \le \mu _{\ell }$ has to hold. It then follows from the definition (2.4) that $\Vert x_k\Vert \le \Vert x_{\ell }\Vert $. $\square $

The monotonic behaviour of the multipliers $\mu _k$ was predicted in [9, Lem.2.6] when $p=3$, but the proof suggested there relied on [11, Thm.2.6], which appears to have a minor flaw—the proof depends on [13, Thm.2.1], but applies this at one point to an indefinite $H+\mu I$. Lemma 1 avoids this issue, and the same result fixes the proof of [11, Thm.2.6] that applies in the trust-region case.

3 Comments and conclusions

We have shown that the norms of the approximations generated by well-known Krylov methods for solving the regularization problem (1.1) increase monotonically as the dimension of the Krylov spaces expands. This implies that the corresponding “multipliers” $\mu _k$ also increase, and is useful as estimates of these multipliers are crucial when solving the Krylov subproblem; in particular, as the multiplier for the k-th problem is a lower bound for the $(k+1)$-st, Newton-like iterations for the required root of the secular equation

$$\begin{aligned} {\Vert \bar{x}_{k+1}( \mu )\Vert ^{p-2} - \frac{\mu }{\sigma } = 0, \;\; \text{ where } \;\; ( V_{k+1}^T H V_{k+1}^{} + \mu I ) \bar{x}_{k+1}^{}( \mu ) = - V_{k+1}^T g, } \end{aligned}$$

will converge both globally and rapidly to $\mu _{k+1}$ when started from $\mu _{k}$ if additionally $\mu _{k} > \lambda _{\min }(T_{k+1})$ [5, §3]. In particular, Newton’s method, the secant method or methods based upon certain higher (odd)-order Taylor approximations or nonlinear rescalings of the term $\Vert \bar{x}( \mu )\Vert ^{p-2}$ all converge monotonically from such a starting $\mu $. Knowledge of the monotonic nature of these quantities is also important when deriving convergence bounds [6] for such methods.

We warn readers that in exceptional circumstances, namely when g is orthogonal to the eigenspace corresponding to the leftmost eigenvalue of H and $\sigma $ is not large enough, the global minimizer of (1.1) will not lie in $\mathcal{K}_m(H,g)$, and $\mu _m$ will underestimate the optimal multiplier. This (zero-probability) possibility is often referred to as the “hard case” ([3], §6,1, [12]), and, despite their popularity, might be viewed as an unavoidable defect of Krylov methods.

The main result here may trivially be extended for Krylov methods to

for given symmetric $M \succ 0$, where $\Vert x\Vert _M^2 :=x^T M x$, so long as we instead consider the Krylov spaces $\mathcal{K}(M^{-1} H, M^{-1} g)$. It is well known that this may be achieved using the M-preconditioned Lanczos method [3, Sec.6.3]. In particular, if

it follows (using the transformation $x \leftarrow M^{{\scriptstyle \frac{1}{2}}} x$) that

$$\begin{aligned} {\mu _k \le \mu _{\ell } \;\; \text{ and } \;\; \Vert x_k\Vert _M \le \Vert x_{\ell }\Vert _M} \end{aligned}$$

for $1 \le k \le \ell \le m$ just as in Theorem 1.

References

Bianconcini, T., Liuzzi, G., Morini, B., Sciandrone, M.: On the use of iterative methods in cubic regularization for unconstrained optimization. Comput. Optim. Appl. 60(1), 35–57 (2015)
Article MathSciNet Google Scholar
Carmon, Y., Duchi, J.C.: Analysis of Krylov subspace solutions of regularized nonconvex quadratic problems. arXiv:1806.09222v1 (2018)
Cartis, C., Gould, N.I.M., Toint, PhL: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. Ser. A 127(2), 245–295 (2011)
Article MathSciNet Google Scholar
Conn, A.R., Gould, N.I.M., Toint, PhL: Trust-Region Methods. SIAM, Philadelphia (2000)
Book Google Scholar
Gould, N.I.M., Robinson, D.P., Thorne, H.S.: On solving trust-region and other regularised subproblems in optimization. Math. Program. Comput. 2(1), 21–57 (2010)
Article MathSciNet Google Scholar
Gould, N.I.M., Simoncini, V.: Error estimates for iterative algorithms for minimizing regularized quadratic subproblems. Optim. Methods Softw. https://doi.org/10.1080/10556788.2019.1670177 (2019)
Griewank, A.: The modification of Newton’s method for unconstrained optimization by bounding cubic terms. Technical Report NA/12, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, England (1981)
Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. In: Precup, D., Teh, Y.W., (eds.) Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1895–1904 (2017)
Lange, M.: Subproblem solutions in cubic regularisation methods. M.Sc. thesis, University of Oxford, England (2017)
Liesen, J., Strakoš, Z.: Krylov Subspace Methods. Oxford University Press, Oxford (2013)
MATH Google Scholar
Lukšan, L., Matonoha, C., Vlček, J.: On Lagrange multipliers of trust-region subproblems. BIT 48(4), 763–768 (2008)
Article MathSciNet Google Scholar
Moré, J.J., Sorensen, D.C.: Computing a trust region step. SIAM J. Sci. Stat. Comput. 4(3), 553–572 (1983)
Article MathSciNet Google Scholar
Steihaug, T.: The conjugate gradient method and trust regions in large scale optimization. SIAM J. Numer. Anal. 20(3), 626–637 (1983)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to thank two referees and the associate editor for their very helpful comments on the original draft of this paper.

Author information

Authors and Affiliations

Mathematical Institute, Oxford University, Oxford, OX2 6GG, England
Coralia Cartis
Scientific Computing Department, STFC-Rutherford Appleton Laboratory, Chilton, OX11 0QX, England
Nicholas I. M. Gould
Institute of Computational Biology, Helmholtz Zentrum München—German Research Center for Environmental Health, 85764, Neuherberg, Germany
Marius Lange
Department of Mathematics, Technische Universität München, 85748, Munich, Germany
Marius Lange

Authors

Coralia Cartis
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas I. M. Gould
View author publications
You can also search for this author in PubMed Google Scholar
Marius Lange
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicholas I. M. Gould.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the EPSRC Grant EP/M025179/1 (NIMG), and both by a German research foundation (DFG) fellowship through the Graduate School of Quantitative Biosciences Munich (QBM) and by the Joachim Herz Stiftung (ML)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cartis, C., Gould, N.I.M. & Lange, M. On monotonic estimates of the norm of the minimizers of regularized quadratic functions in Krylov spaces. Bit Numer Math 60, 583–589 (2020). https://doi.org/10.1007/s10543-019-00791-2

Download citation

Received: 10 April 2019
Accepted: 27 November 2019
Published: 13 December 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10543-019-00791-2

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On monotonic estimates of the norm of the minimizers of regularized quadratic functions in Krylov spaces

Abstract

Similar content being viewed by others

On the Properties of the Method of Minimization for Convex Functions with Relaxation on the Distance to Extremum

New Generalized Convexity and Their Applications

A Local Nearly Linearly Convergent First-Order Method for Nonsmooth Functions with Quadratic Growth

1 Introduction