1 Introduction

Given a real symmetric, possibly indefinite, matrix H and vector g, we are concerned with Krylov methods for approximating the global solution of the possibly nonconvex regularization problem

(1.1)

where \(\sigma > 0\), \(p > 2\) and \(\Vert \cdot \Vert \) is the Euclidean norm—note that Q is bounded below over \(\mathfrak {R}^n\), and all global minimizers have the same norm [7, §3]. Such methods have been advocated by a number of authors, e.g., [1,2,3, 8]. Here we are interested in how the norms of the estimates of the solution evolve as the Krylov process proceeds. The main utility is that these estimates provide useful predictions for the “multipliers” \(\sigma \Vert x\Vert ^{p-2}\) as the Krylov subspace expands [9]. Our result is an analogue of that obtained by Lukšan, Matonoha and Vlček [11] for the trust-region subproblem.

By way of motivation and explanation, the solution \(x_*\) to (1.1) necessarily satisfies the first-order criticality condition \(\nabla _x Q(x_*;\sigma ,p) = 0\), i.e.,

$$\begin{aligned} {(H + \mu _* I) x_* = - g, \;\; \text{ where } \;\; \mu _* = \sigma \Vert x_*\Vert ^{p-2}.} \end{aligned}$$
(1.2)

In addition, \(H + \mu _* I\) is positive semi-definite at any global minimizer, and if \(H + \mu _* I\) is positive definite, the minimizer is unique [3, Thm.3.1]. If \(\mu _*\) was known, the minimizer might be found simply by solving the linear system (1.2), and the skill is then in finding convergent estimates of \(\mu _*\) by iteration [5]. Briefly, this is achieved by seeking the rightmost root of the secular equation

$$\begin{aligned} \Vert x( \mu )\Vert ^{p-2} - \frac{\mu }{\sigma } = 0, \end{aligned}$$

where

$$\begin{aligned} {( H + \mu I ) x( \mu ) = - g} \end{aligned}$$
(1.3)

while ensuring that \( H + \mu I\) is positive semi-definite. This is always possible so long as g intersects the subspace \(\mathcal{U}\) of eigenvectors of H corresponding to the leftmost eigenvalue \(\lambda _{\min }(H)\) of H, and in this case \(H + \mu _* I\) is positive definite. The rare possibility that the later does not happen is known colloquially as the “hard case” [12], and the solution to (1.1) in the hard case involves an addition component from \(\mathcal{U}\). We also make the connection between (1.1) and the regularized quadratic

$$\begin{aligned} {Q_{\nu }(x) :={\scriptstyle \frac{1}{2}}x^T ( H + \nu I ) x + g^T x,} \end{aligned}$$
(1.4)

namely that if \(\nu = \mu _*\) and the hard case does not occur then \(x_*\) minimizes \(Q_{\nu }(x)\).

This all presupposes that one can solve the linear system (1.3), and unfortunately in some applications matrix factorization is out of the question, indeed H may only be available indirectly via Hessian-vector products Hv for given v. An attractive alternative in such cases is to find an approximation to the solution of (1.1) by restricting the search domain to a subspace or sequence of subspaces. A particularly appealing set of nested subspaces is provided by the Krylov space defined by H and g that we will define formally in the next section. Crucially, the k-th Krylov subspace, \(\mathcal{K}_k(H,g)\), may be generated recursively through Hessian-vector products, and has an orthogonal basis \(V_k\), with the desirable property that if

then \(x_k = V_k \bar{x}_k\), where

(1.5)

the vector \(e_1 \in \mathfrak {R}^k\) is the first column of the identity matrix and \(T_k^{} = V_k^T H V_k^{} \in \mathfrak {R}^{k \times k}\) is tridiagonal. The latter implies that solving (1.5) is feasible via its optimality equations

$$\begin{aligned} { (T_k + \mu _k I) \bar{x}_k = - \Vert g\Vert e_1, \;\; \text{ where } \;\; \mu _k = \sigma \Vert x_k\Vert ^{p-2},} \end{aligned}$$

even when the dimension is large, since factorizing shifted tridiagonal matrices and solving linear systems involving them may be achieved in a few multiples of k floating-point operations.

Of course, we need to judge when \(x_k\) is a meaningful approximation to \(x_*\) as the subspaces evolve, and furthermore to solve each successive subproblem (1.5) efficiently. The former is addressed in [2, 6] and requires estimates of \(\mu _k\), while the latter appeals to the ideas in [5] and relies on a good starting “guess” for \(\mu _k\). Thus generating a good starting guess provides motivation for our short paper.

In the next section we provide a set of lemmas leading to our main result, namely that so long as the evolving Krylov subspaces are of full dimensionality, the norms of the solution estimates \(\Vert s_k\Vert \) and the corresponding “multipliers” \(\mu _k\) increase monotonically. We summarize, extend and discuss implications and limitations of our results in the concluding Sect. 3.

2 The main result

We start with four vital lemmas that we use to prove our main result. The first shows a simple property of the conjugate gradient method. We use the generic notation \(H \succ 0\) (resp. \(H \succeq 0\)) to mean that the real, symmetric matrix H is positive definite (resp. positive semi-definite).

Lemma 1

Given a real symmetric matrix H and real vector g, let

\(k \ge 1\), be the k-th Krylov subspace generated by H and the vector g, and let the columns of \(V_k\) provide an orthonormal basis for \(\mathcal{K}_k(H,g)\). Letting \(\ell \ge k \ge 1\), suppose that

$$\begin{aligned} {V_{\ell }^T H V_{\ell }^{} \succ 0} \end{aligned}$$
(2.1)

and define

Then

$$\begin{aligned} {\Vert x_k \Vert \le \Vert x_{\ell } \Vert .} \end{aligned}$$

Proof

This follows from [4, Thm.7.5.1] as the requirement there, namely that \(p_k^T H p_k^{} > 0\) for specific vectors \(p_k \in \mathcal{K}_k(H,g)\), is implied by the more general assumption (2.1). \(\square \)

Note that this is a generalization of [13, Thm.2.1] that relaxes the requirement that H be everywhere positive definite to be so merely over the evolving Krylov subspaces of interest.

Our second lemma compares Krylov subspaces of the matrices H and \(H +\mu I\) for some \(\mu \in \mathfrak {R}\).

Lemma 2

[11, Lem.2.3]. Let H, g and \(\mathcal{K}_k\) be as in Lemma 1, and \(\mu \in \mathfrak {R}\). Then

$$\begin{aligned} {\mathcal{K}_k (H+\mu I, g ) = \mathcal{K}_k(H,g).} \end{aligned}$$
(2.2)

Next, we state a crucial relation between the parameter \(\nu \) that defines \(Q_{\nu }(x)\) in (1.4) and the norm of the minimizer of \(Q_{\nu }(x)\) within the k-th Krylov space.

Lemma 3

[11, Lem.2.5]. Suppose that the columns of \(V_k\) provide an orthonormal basis for \(\mathcal{K}_k(H,g)\) for given real symmetric H and real g. Let \(V_k^T H V_k^{} + \nu _i^{} I\), \(\nu _i \in \mathfrak {R}\), \(i \in \{1, 2\}\), be symmetric and positive definite. Let

Then

$$\begin{aligned} {\nu _2 \le \nu _1 \;\; \text{ if } \text{ and } \text{ only } \text{ if } \;\; \Vert x_k (\nu _2 )\Vert \ge \Vert x_k (\nu _1 )\Vert .} \end{aligned}$$

We define the grade of H and g, \(\text{ grade }(H,g) \le n\), to be the maximum dimension of the evolving Krylov spaces \(\mathcal{K}_k(H,g)\), \(k = 1, \ldots , n\) [10]. Our final lemma indicates that the evolving minimizers are unique.

Lemma 4

Let H, g and \(V_k\) be as in Lemma 3 and let \(\mu _k\) by the rightmost root of the secular equation

$$\begin{aligned} { \Vert \bar{x}_k^{}(\mu )\Vert ^{p-2} - \frac{\mu }{\sigma } = 0, \;\; \text{ where } \;\; ( V_k^T H V_k^{} + \mu I ) \bar{x}_k^{}( \mu ) = - V_k^T g} \end{aligned}$$
(2.3)

Then \(V_k^T H V_k^{} + \mu _k^{} I \succ 0\) for all \(1 \le k \le m :=\text{ grade }(H,g)\).

Proof

Using the Lanczos orthonormal basis, we have that \(V_k^T H V_k^{} = T_k^{}\) for an irreducible tridiagonal matrix \(T_k\) for \(k = 1, \ldots , m\). It then follows [4, Thm.7.5.12] that \(\mathcal{K}_k(H,g)\) has a nontrivial intersection with the space of eigenvectors of \(T_k\) corresponding to the eigenvalue \(\lambda _{\min }(T_k)\) (i.e., the “hard case” cannot occur), and thus that the only permitted root \(\mu _k\) of the secular Eq. (2.3) for the problem satisfies \(\mu _k > - \lambda _{\min }(T_k)\), where \(\lambda _{\min }\) denotes the leftmost eigenvalue of its symmetric matrix argument [5, Sec.2.2]. \(\square \)

We are now in a position to state and prove our main theorem.

Theorem 1

Given a real symmetric matrix H, vector g and scalars \(\sigma > 0\) and \(p > 2\), let \(m = \text{ grade }(H,g)\),

and

$$\begin{aligned} {\mu _j = \sigma \Vert x_j\Vert ^{p-2}} \end{aligned}$$
(2.4)

for \(j \ge 1\). Then \(\mu _k \le \mu _{\ell }\) and \(\Vert x_k\Vert \le \Vert x_{\ell }\Vert \) for \(1 \le k \le \ell \le m\).

Proof

Let \(V_j\) be as in the statement of Lemma 3. The vector \(x_j = V_j y_j\) is a minimizer of the j-th regularization subproblem if and only if

$$\begin{aligned} {V_j^T (H + \mu _j I) V_j y_j = -V_j^T g, \;\; V_j^T (H + \mu _j I) V_j \succeq 0, \;\; \text{ and } \;\; \mu _j = \sigma \Vert y_j\Vert ^{p-2},} \end{aligned}$$
(2.5)

and the minimizer is unique since \(V_j^T (H + \mu _j I) V_j \succ 0\) from Lemma 4 [5, Thm.2]. Consider two integers k and \(\ell \) for which \(1 \le k \le \ell \le m\).

Since we have \(V_k^T (H + \mu _k I)V_k \succ 0\) and \(V_{\ell }^T (H + \mu _{\ell } I)V_{\ell } \succ 0\), and as \(\mathcal{K}_k (H+\mu _kI,g) = \mathcal{K}_k(H,g)\) by Lemma 2, it follows from (2.5) that \(x_k\) is also the (unique) solution of the subspace minimization problem

Assume that \(\mu _k > \mu _{\ell }\), which implies that \(V_{\ell }^T (H + \mu _k I)V_{\ell } \succ 0\). Let

Then it follows from Lemma 1 that

(2.6)

But since \(\mu _{\ell } < \mu _k\), Lemma 3 gives that

$$\begin{aligned} {\Vert x_{\ell } (\mu _k )\Vert \le \Vert x_{\ell } (\mu _{\ell } )\Vert = \Vert x_{\ell }\Vert .} \end{aligned}$$
(2.7)

Hence using the definition (2.4) and combining the inequalities (2.6) and (2.7)

$$\begin{aligned} {\mu _k = \sigma \Vert x_k\Vert ^{p-2} \le \sigma \Vert x_{\ell }\Vert ^{p-2} = \mu _{\ell } < \mu _k} \end{aligned}$$

which is a contradiction. Thus \(\mu _k \le \mu _{\ell }\) has to hold. It then follows from the definition (2.4) that \(\Vert x_k\Vert \le \Vert x_{\ell }\Vert \). \(\square \)

The monotonic behaviour of the multipliers \(\mu _k\) was predicted in [9, Lem.2.6] when \(p=3\), but the proof suggested there relied on [11, Thm.2.6], which appears to have a minor flaw—the proof depends on [13, Thm.2.1], but applies this at one point to an indefinite \(H+\mu I\). Lemma 1 avoids this issue, and the same result fixes the proof of [11, Thm.2.6] that applies in the trust-region case.

3 Comments and conclusions

We have shown that the norms of the approximations generated by well-known Krylov methods for solving the regularization problem (1.1) increase monotonically as the dimension of the Krylov spaces expands. This implies that the corresponding “multipliers” \(\mu _k\) also increase, and is useful as estimates of these multipliers are crucial when solving the Krylov subproblem; in particular, as the multiplier for the k-th problem is a lower bound for the \((k+1)\)-st, Newton-like iterations for the required root of the secular equation

$$\begin{aligned} {\Vert \bar{x}_{k+1}( \mu )\Vert ^{p-2} - \frac{\mu }{\sigma } = 0, \;\; \text{ where } \;\; ( V_{k+1}^T H V_{k+1}^{} + \mu I ) \bar{x}_{k+1}^{}( \mu ) = - V_{k+1}^T g, } \end{aligned}$$

will converge both globally and rapidly to \(\mu _{k+1}\) when started from \(\mu _{k}\) if additionally \(\mu _{k} > \lambda _{\min }(T_{k+1})\) [5, §3]. In particular, Newton’s method, the secant method or methods based upon certain higher (odd)-order Taylor approximations or nonlinear rescalings of the term \(\Vert \bar{x}( \mu )\Vert ^{p-2}\) all converge monotonically from such a starting \(\mu \). Knowledge of the monotonic nature of these quantities is also important when deriving convergence bounds [6] for such methods.

We warn readers that in exceptional circumstances, namely when g is orthogonal to the eigenspace corresponding to the leftmost eigenvalue of H and \(\sigma \) is not large enough, the global minimizer of (1.1) will not lie in \(\mathcal{K}_m(H,g)\), and \(\mu _m\) will underestimate the optimal multiplier. This (zero-probability) possibility is often referred to as the “hard case” ([3], §6,1, [12]), and, despite their popularity, might be viewed as an unavoidable defect of Krylov methods.

The main result here may trivially be extended for Krylov methods to

for given symmetric \(M \succ 0\), where \(\Vert x\Vert _M^2 :=x^T M x\), so long as we instead consider the Krylov spaces \(\mathcal{K}(M^{-1} H, M^{-1} g)\). It is well known that this may be achieved using the M-preconditioned Lanczos method [3, Sec.6.3]. In particular, if

it follows (using the transformation \(x \leftarrow M^{{\scriptstyle \frac{1}{2}}} x\)) that

$$\begin{aligned} {\mu _k \le \mu _{\ell } \;\; \text{ and } \;\; \Vert x_k\Vert _M \le \Vert x_{\ell }\Vert _M} \end{aligned}$$

for \(1 \le k \le \ell \le m\) just as in Theorem 1.