1 Introduction

(Golub–Kahan–) Lanczos bidiagonalization [5] (see also, e.g., [6]) is a popular method to approximate singular values of large sparse matrices. Let \(A\) be a real \(m \times n\) matrix with singular value decomposition (SVD) \(A=X\Sigma Y^T\) with singular values

$$\begin{aligned} 0 \le \sigma _{\min } = \sigma _p \le \sigma _{p-1} \le \cdots \le \sigma _2 \le \sigma _1 = \sigma _{\max } = \Vert A\Vert , \end{aligned}$$

where \(p {:}{=} \min \{m,n\}\), and \(\Vert \cdot \Vert \) stands for the 2-norm. Denote the corresponding right singular vectors by \(\mathbf{y}_1, \ldots , \mathbf{y}_n\). Usually, Lanczos bidiagonalization approximates the largest singular values, and, to a lesser extent, the smallest singular values, well. However, the results of the method depend on the choice of the initial vector \(\mathbf{v}_1\). The obtained approximation to largest singular value \(\sigma _{\max }\) is always a lower bound. However, if a poor choice is made for \(\mathbf{v}_1\), that is, if \(\mathbf{v}_1\) is almost deficient in the direction \(\mathbf{y}_1\), the true value of \(\Vert A\Vert \) may be arbitrarily larger. Often there is no apriori information on \(\mathbf{y}_1\) available. For this reason a random choice for \(\mathbf{v}_1\) is considered relatively safe; \(\mathbf{v}_1\) is usually selected randomly in (industrial) codes.

Using the fact that \(\mathbf{v}_1\) is chosen randomly, we will develop probabilistic bounds for \(\Vert A\Vert \); i.e., bounds that hold with a user-selected probability \(1-\varepsilon \), for \(\varepsilon \ll 1\). The bounds may be viewed as a side-product or post-processing step of Lanczos bidiagonalization and may be computed efficiently: for large \(A\), the computational costs are very modest compared to the Lanczos bidiagonalization process itself.

The fact that it is unlikely that a random vector is near-deficient in \(\mathbf{y}_1\) enables us to develop probabilistic inclusion intervals for the matrix norm. Hereby we exploit the fact that the Lanczos polynomials tend to increase rapidly to the right of its largest zero (see Sect. 2). Therefore, with our new low-cost process as addition to the Lanczos bidiagonalization method, we usually not only get good lower bounds to \(\Vert A\Vert \), but also get sharp upper bounds with a high probability.

Efficient state-of-the-art methods based on Lanczos bidiagonalization use some restart mechanism; see, e.g., [1, 11]. We will not consider restarts in this paper for two main reasons: first, the unrestarted case makes possible the theoretical analysis of Sects. 2, 3 and 4; and second, it will turn out that usually a modest number of Lanczos bidiagonalization steps already suffices for quality probabilistic inclusion intervals. We will also assume exact arithmetic; in the experiments in Sect. 5 we exploit a stable variant with reorthogonalization.

This paper is inspired by [13] and has been organized as follows. Section 2 studies polynomials that are implicitly formed in the Lanczos bidiagonalization process. These are used in Sects. 3 and 4 to develop probabilistic upper bounds for the matrix 2-norm. Numerical experiments are presented in Sect. 5, and a discussion and some conclusions can be found in Sect. 6.

2 Polynomials Arising in Lanczos Bidiagonalization

Given a vector \(\mathbf{v}_1\) with unit norm, the defining relations of Lanczos bidiagonalization are \(\beta _0 = 0, \mathbf{u}_0 = \mathbf{0}\), and for \(k \ge 1\):

$$\begin{aligned} \begin{array}{rcl} \alpha _k \mathbf{u}_k &{} = &{} A \mathbf{v}_k - \beta _{k-1} \mathbf{u}_{k-1} \\ \beta _k \mathbf{v}_{k+1} &{} = &{} A^T \! \mathbf{u}_k - \alpha _k \mathbf{v}_k \end{array} \end{aligned}$$
(1)

where

$$\begin{aligned} \alpha _j = \mathbf{u}_j^T \! A\mathbf{v}_j, \qquad \beta _j = \mathbf{u}_j^T \! A\mathbf{v}_{j+1} \end{aligned}$$
(2)

are nonnegative. After \(k\) steps of the method, these relations can be written in matrix form as

$$\begin{aligned} \begin{array}{rll} A V_k &{} = &{} U_k B_k, \\ A^T U_k &{} = &{} V_{k+1} \widehat{B}_k^T = V_k B_k^T + \beta _k \mathbf{v}_{k+1} \mathbf{e}_k^T, \end{array} \end{aligned}$$

where \(\mathbf{e}_k\) is the \(k\)th unit vector, and \(U_k = [\mathbf{u}_1 \cdots \mathbf{u}_k]\) and \(V_k = [\mathbf{v}_1 \cdots \mathbf{v}_k]\) have orthonormal columns spanning the subspaces \(\mathcal U _k\) and \(\mathcal V _k\), respectively. The \(k \times k\) matrix

$$\begin{aligned} B_k = \left[ \begin{array}{cccl} \alpha _1 &{} \beta _1 &{} &{} \\ &{} \ddots &{} \ddots &{} \\ &{} &{} \alpha _{k-1} &{} \beta _{k-1} \\ &{} &{} &{} \alpha _k \end{array} \right] \end{aligned}$$

and the \(k \times (k+1)\) matrix \(\widehat{B}_k = [B_k \ \, \mathbf{0}] + \beta _k \mathbf{e}_k \mathbf{e}_{k+1}^T\) are both upper bidiagonal matrices. We will not consider the rather exceptional situation of a breakdown of the method (a zero \(\alpha _j\) or \(\beta _j\)) in this paper.

Introduce the bilinear forms

$$\begin{aligned} \langle f, g \rangle {:}{=} \mathbf{v}_1^T f(A^T \! A) \, g(A^T \! A) \, \mathbf{v}_1 \end{aligned}$$

and

$$\begin{aligned}{}[f, g] {:}{=} \mathbf{v}_1^T A^T f(AA^T) \, g(AA^T) \, A\mathbf{v}_1 = \mathbf{v}_1^T f(A^T \! A) \, A^T\!A \ g(A^T \! A) \, \mathbf{v}_1 \end{aligned}$$

for functions \(f\) and \(g\) that are analytic in a neighborhood of the squares of the singular values of \(A\). The following result is the starting point for this paper.

Proposition 1

The \(\mathbf{u}_k\) and \(\mathbf{v}_k\) can be written as a polynomial of degree \(k-1\) in \(AA^T\), resp. \(A^T \! A\), applied to \(A\mathbf{v}_1\), resp. \(\mathbf{v}_1\):

$$\begin{aligned} \mathbf{u}_k = p_{k-1}(AA^T) \, A\mathbf{v}_1, \qquad \mathbf{v}_k = q_{k-1}(A^T \! A) \, \mathbf{v}_1. \end{aligned}$$

The following recurrence relations hold: \(p_{-1}(t) = 0, q_0(t) = 1\), and for \(k \ge 0\):

$$\begin{aligned} \begin{array}{lll} \alpha _{k+1} \, p_k(t) &{} = &{} \, q_k(t) - \beta _k \, p_{k-1}(t), \\ \beta _{k+1} \, q_{k+1}(t) &{} = &{} t \, p_k(t) - \alpha _{k+1} \, q_k(t). \end{array} \end{aligned}$$

Moreover,

$$\begin{aligned} \begin{array}{llll} \alpha _k &{} = &{} \langle p_{k-1}, \, t \, q_{k-1} \rangle &{} = [p_{k-1}, \, q_{k-1}], \\ \beta _k &{} = &{} \langle p_{k-1}, \, t \, q_k \rangle &{} = [p_{k-1}, \, q_k]. \end{array} \end{aligned}$$

Proof

This follows by induction; the recurrence relations follow from substitution into (1). The inner products can be derived from (2). \(\square \)

We now study several useful properties of these Lanczos bidiagonalization polynomials \(p_k\) and \(q_k\) that will be used in the rest of the paper. First, we point out close relations between Lanczos bidiagonalization and two other Lanczos processes. Note that

$$\begin{aligned} A^T \! A V_k&= A^T U_k B_k \nonumber \\&= V_k B_k^T B_k + \beta _k \mathbf{v}_{k+1} \mathbf{e}_k^T B_k \\&= V_k B_k^T B_k + \alpha _k \beta _k \mathbf{v}_{k+1} \mathbf{e}_k^T \nonumber \end{aligned}$$
(3)

and

$$\begin{aligned} AA^T U_k&= A V_k B_k^T + \beta _k A \mathbf{v}_{k+1} \mathbf{e}_k^T \nonumber \\&= U_k B_k B_k^T + \beta _k U_{k+1} B_{k+1} \mathbf{e}_{k+1} \mathbf{e}_k^T \\&= U_k \widehat{B}_k \widehat{B}_k^T + \alpha _{k+1} \beta _k \mathbf{u}_{k+1} \mathbf{e}_k^T. \nonumber \end{aligned}$$
(4)

We see from these equations that Lanczos bidiagonalization simultaneously performs a Lanczos process on \(A^T \! A\) with starting vector \(\mathbf{v}_1\), and a Lanczos process on \(AA^T\) with starting vector \(\mathbf{u}_1 {:}{=} \alpha _1^{-1} A\mathbf{v}_1\) (the normalized \(A\mathbf{v}_1\)). The symmetric tridiagonal matrices \(B_k^T B_k\) and \(\widehat{B}_k \widehat{B}_k^T\), respectively, that arise in the Lanczos methods are decomposed as the product of the bidiagonal matrices that arise in Lanczos bidiagonalization. We use (3) and (4) to characterize the zeros of the polynomials \(p_k\) and \(q_k\); see Proposition 2.

Denote the singular values of \(B_k\) by

$$\begin{aligned} \theta _k^{(k)} \le \dots \le \theta _1^{(k)} \end{aligned}$$

and the corresponding right singular vectors by \(\mathbf{d}_k^{(k)}\), ..., \(\mathbf{d}_1^{(k)}\). We write \(\widehat{\theta }_k^{(k)} \le \dots \le \widehat{\theta }_1^{(k)}\) for the singular values of \(\widehat{B}_k\) and \(\widehat{\mathbf{c}}_k^{(k)}, \ldots , \widehat{\mathbf{c}}_1^{(k)}\) for its left singular vectors. To avoid a heavy notation we will often omit the superscript \((k)\) in the sequel. A key aspect of Lanczos bidiagonalization is that often the singular values of both \(B_k\) and \(\widehat{B}_k\) are good approximations to the singular values of \(A\); in particular to the largest and (to a lesser extent) to the smallest singular values.

In the next proposition, \(I_k\) stands for the identity of dimension \(k\).

Proposition 2

  1. (a)

    The zeros of \(q_k\) are exactly \(\theta _1^2, \ldots , \theta _k^2\). This implies that \(q_k(t)\) is a nonzero multiple of \(\det (tI_k-B_k^TB_k)\).

  2. (b)

    The zeros of \(p_k\) are exactly \(\widehat{\theta }_1^2, \ldots , \widehat{\theta }_k^2\). This implies that \(p_k(t)\) is a nonzero multiple of \(\det (tI_k-\widehat{B}_k \widehat{B}_{k+1}^T)\).

Proof

From (3) it may be checked that the pairs \((\theta _j^2, V_k\mathbf{d}_j), j=1,\ldots ,k\), satisfy the Galerkin condition

$$\begin{aligned} A^T \! AV_k \mathbf{d}_j - \theta _j^2 \, V_k \, \mathbf{d}_j \perp \mathcal V _k. \end{aligned}$$

Since \(V_k\mathbf{d}_j \in \mathcal V _k\), we can write

$$\begin{aligned} V_k\mathbf{d}_j = s_j(A^T \! A) \, \mathbf{v}_1 \end{aligned}$$
(5)

for a polynomial \(s_j = s_j^{(k)}\) of degree at most \(k-1\). For all \(j=1,\ldots ,k\), we have that \((A^T \! A - \theta _j^2 I) V_k\mathbf{d}_j\) is in \(\mathcal V _{k+1}\) but is orthogonal to \(\mathcal V _k\). Therefore, these vectors have to be nonzero multiples of the vector \(\mathbf{v}_{k+1} = q_k(A^T \! A) \, \mathbf{v}_1\). Hence, \(q_k(t)\) should contain all factors \((t-\theta _j^2)\), and therefore is a nonzero multiple of

$$\begin{aligned} \mu (t) = (t-\theta _1^2) \cdots (t-\theta _k^2). \end{aligned}$$

Part (b) follows in a similar manner starting with the Galerkin condition

$$\begin{aligned} AA^T\!U_k \widehat{\mathbf{c}}_j - \widehat{\theta }_j^2 \, U_k \, \widehat{\mathbf{c}}_j \perp \mathcal U _k \end{aligned}$$

for the pairs \((\widehat{\theta }_j^2, U_k\widehat{\mathbf{c}}_j)\). Since \(U_k\widehat{\mathbf{c}}_j \in \mathcal U _k\), we can write

$$\begin{aligned} U_k\widehat{\mathbf{c}}_j = r_j(AA^T) A\mathbf{v}_1 \end{aligned}$$
(6)

for a polynomial \(r_j = r_j^{(k)}\) of degree at most \(k-1\). For all \(j=1,\ldots ,k\), we have that \((AA^T\!-\widehat{\theta }_j^2 I) U_k \widehat{\mathbf{c}}_j\) is in \(\mathcal U _{k+1}\) but is orthogonal to \(\mathcal U _k\). Therefore, these vectors have to be nonzero multiples of the vector \(\mathbf{u}_{k+1} = p_k(AA^T) A\mathbf{v}_1\). Hence, \(p_k(t)\) should contain all factors \((t-\widehat{\theta }_j^2)\), and therefore is a nonzero multiple of

$$\begin{aligned} \widehat{\mu }(t) = (t-\widehat{\theta }_1^2) \cdots (t-\widehat{\theta }_k^2); \end{aligned}$$

cf. also the discussion in [10, p. 266–267]. \(\square \)

Corollary 3

$$\begin{aligned} \begin{array}{rcl} V_k \mathbf{d}_j &{} = &{} \nu _j(A^T \! A) \mathbf{v}_1 \, / \, \Vert \nu _j(A^T \! A) \mathbf{v}_1\Vert \qquad (j=1,\ldots ,k), \\ \Vert \mu (A^T \! A) \mathbf{v}_1\Vert &{} = &{} \min \Vert \omega (A^T \! A) \mathbf{v}_1\Vert ,\\ U_k \widehat{\mathbf{c}}_j &{} = &{} \widehat{\nu }_j(AA^T) A\mathbf{v}_1 \, / \, \Vert \widehat{\nu }_j(AA^T) A\mathbf{v}_1\Vert , \qquad (j=1,\ldots ,k), \\ \Vert \widehat{\mu }(AA^T) A\mathbf{v}_1\Vert &{} = &{} \min \Vert \omega (AA^T) A\mathbf{v}_1\Vert , \end{array} \end{aligned}$$

where \(\nu _j(t) = \mu (t) / (t-\theta _j^2), \widehat{\nu }_j(t) = \widehat{\mu }(t) / (t-\widehat{\theta }_j^2)\), and the minimum is taken over all monic polynomials \(\omega \) of degree \(k\).

Proof

This follows from the proof of the previous proposition; cf. also [10, p. 266]. \(\square \)

The following results will be used for an efficient numerical procedure in the next section.

Proposition 4

The polynomials \(p_k\) and \(q_k\) have positive leading coefficients and increase strictly monotonically to the right of their largest zeros \(\widehat{\theta }_1^2\) and \(\theta _1^2\), respectively.

Proof

This follows from Proposition 2 and the fact that \(p_k\) and \(q_k\) are polynomials of degree \(k\). \(\square \)

Proposition 5

For \(1 \le j \le k\) the convergence to the largest singular values is monotonic:

$$\begin{aligned} \theta _j^{(k)} \le \widehat{\theta }_j^{(k)} \le \theta _j^{(k+1)} \le \sigma _j. \end{aligned}$$

Proof

This follows from the fact that \(\widehat{B}_k\) is the matrix \(B_k\) expanded with an extra \((k+1)\)st column. Likewise, \(B_{k+1}\) is \(\widehat{B}_k\) expanded with an extra \((k+1)\)st row. Now apply [8, (3.3.17)], see also [7, Theorem 4.3]. \(\square \)

Taking \(j=1\) in Proposition 5, this implies that the largest singular values of \(B_k\) and \(\widehat{B}_k\) are guaranteed lower bounds for \(\Vert A\Vert \) of increasing quality. Furthermore, the polynomials \(p_k\) and \(q_k\) will be used for probabilistic bounds for the matrix norm in the next section.

3 Probabilistic Bounds for the Matrix Norm

We will now develop probabilistic bounds for \(\Vert A\Vert \ (= \sigma _1 = \sigma _{\max })\), making use of the fact that the polynomials \(p_k\) and \(q_k\) tend to increase rapidly to the right of their largest zeros \(\widehat{\theta }_1\) and \(\theta _1\), respectively. Let

$$\begin{aligned} \mathbf{v}_1 = \sum _{j=1}^n \gamma _j \, \mathbf{y}_j \end{aligned}$$

be the decomposition of the starting vector \(\mathbf{v}_1\) with respect to the right singular vectors.

Lemma 6

We have \(p_k(\sigma _1^2) > 0\) and \(q_k(\sigma _1^2) > 0\).

Proof

This follows from the combination of Propositions 4 and 5.

We now arrive at the main argument. From

$$\begin{aligned} 1 = \Vert \mathbf{v}_{k+1}\Vert ^2 = \Vert q_k(A^T \! A) \mathbf{v}_1\Vert ^2 = \sum _{j=1}^n \gamma _j^2 \, {q_k (\sigma _j^2)}^2 \end{aligned}$$

and \(q_k(\sigma _1^2) > 0\) (see Lemma 6) it follows that

$$\begin{aligned} 1 \, \ge \, |\gamma _1| \, q_k (\sigma _1^2). \end{aligned}$$

If \(\gamma _1\) would be known, this estimate would provide an upper bound \(\sigma _\mathrm{up}\) for \(\Vert A\Vert = \sigma _{\max }\): let \(\sigma _\mathrm{up}\) be the largest zero of

$$\begin{aligned} f_1(t) = q_k(t^2) - 1 / |\gamma _1|. \end{aligned}$$
(7)

One may check that this number \(\sigma _\mathrm{up}\) exists and is larger than \(\theta _1 = \theta _1^{(k)}\); it may for instance be determined numerically efficiently by bisection on the interval \([\theta _1^{(k)}, \, \Vert A\Vert _F]\) which is guaranteed to contain \(\sigma _{\max }\). (Note that \(\sigma _\mathrm{up}\) might incidentally even be larger than \(\Vert A\Vert _F\) for small \(k\); in this case we proceed with a larger \(k\), as the information is not useful.)

Since we generally do not know (an estimate to) \(\gamma _1\) in practice, we are interested in the probability that \(|\gamma _1|\) is smaller than a given (small) constant. A small \(|\gamma _1|\) corresponds to an unlucky choice of an initial vector: in this case \(\mathbf{v}_1\) is almost orthogonal to \(\mathbf{y}_1\). The following lemma states a suitable result and enables us to establish probabilistic bounds, i.e., bounds that hold with a certain (user-defined, high) probability. The proof uses the fact that if \(\mathbf{v}_1\) has been chosen randomly with respect to the uniform distribution over the unit sphere \(S^{n-1}\) in \(\mathbb R ^n\), then, as a result, \((\gamma _1, \ldots , \gamma _n)\) is also random in \(S^{n-1}\). It is easy to construct this random vector (Matlab code: v1=randn(n,1); v1=v1/norm(v1)); see, e.g., [9, p. 1116].

Lemma 7

Assume that the starting vector \(\mathbf{v}_1\) has been chosen randomly with respect to the uniform distribution over the unit sphere \(S^{n-1}\) and let \(\delta \in [0,1]\). Then

$$\begin{aligned} P(|\gamma _1| \le \delta ) = 2 \, G(\textstyle \frac{n-1}{2}, \frac{1}{2}) ^{-1} \cdot {\displaystyle \int \limits _0^{\arcsin (\delta )}} \cos ^{n-2}(t) \, \mathrm{dt}, \end{aligned}$$

where \(G\) denotes Euler’s Beta function: \(G(x,y) = \int _0^1 t^{x-1} (1-t)^{y-1} \mathrm{dt}\).

Proof

See [13, Lemma 3.1]. \(\square \)

If we would like to have an upper bound for \(\Vert A\Vert \) that is correct with probability at least \(1-\varepsilon \), then we first determine the value of \(\delta \) for which

$$\begin{aligned} \int \limits _0^{\arcsin (\delta )} \cos ^{n-2}(t) \, \mathrm{dt} = \textstyle \frac{\varepsilon }{2} \, G(\textstyle \frac{n-1}{2}, \frac{1}{2}) \quad \left( = \varepsilon \, {\displaystyle \int \limits _0^{\pi /2}} \cos ^{n-2}(t) \, \mathrm{dt} \right) \end{aligned}$$
(8)

holds, e.g., by bisection on the interval \([0, \frac{\pi }{2}]\). The integrals in (8) may be computed using an appropriate quadrature formula.

Moreover, for small \(\varepsilon \), which is our main interest, the behavior of \(\delta \) as a function of \(\varepsilon \) is roughly \(\delta = \delta (\varepsilon ) \approx \varepsilon \cdot \frac{1}{2} \, G(\textstyle \frac{n-1}{2}, \frac{1}{2})\) as is proven in the next result. As an example, we mention that for \(n=1000\) and \(\varepsilon = 0.01\), the true and estimated value for \(\delta \) with Proposition 8 differ only \(\approx 2.6 \cdot 10^{-5}\) relatively.

Proposition 8

Given \(0 < \varepsilon \ll 1\), let \(\delta = \delta (\varepsilon )\) satisfy (8). Then

$$\begin{aligned} \delta ^{\prime }(0) = \lim _{\varepsilon \rightarrow 0} \frac{\delta (\varepsilon )}{\varepsilon } = \textstyle \frac{1}{2} \, G\left( \frac{n-1}{2}, \frac{1}{2}\right) . \end{aligned}$$

Proof

First note that \(\arcsin (\delta ) = \delta + \mathcal O (\delta ^3)\) for \(\delta \rightarrow 0\). Let \(F(\delta (\varepsilon )) = \int _0^{\delta (\varepsilon )} \cos ^{n-2}(t) \, \mathrm{d}t\). Then

$$\begin{aligned} \lim _{\varepsilon \rightarrow 0} \frac{F(\delta (\varepsilon ))-F(0)}{\varepsilon } = \cos ^{n-2}(0) \cdot \delta ^{\prime }(0) = \textstyle \frac{1}{2} G(\frac{n-1}{2}, \frac{1}{2}), \end{aligned}$$

which proves the statement.

When we replace \(|\gamma _1|\) in (7) by the value \(\delta \) computed from (8) and determine the zero \(\sigma _\mathrm{up} > \theta _1^{(k)}\), this \(\sigma _\mathrm{up}\) is an upper bound for the largest singular value \(\sigma _{\max }\) of \(A\) with probability at least \(1-\varepsilon \), which we call a probabilistic upper bound. This zero may be computed efficiently, since the evaluation of \(p_k\) and \(q_k\) may be carried out via the recurrence relations as in Proposition 1. (Note that a loop is often preferable over a recursion for a fast implementation.)

A similar line of reasoning can also be followed for the \(p_k\) polynomials: from

$$\begin{aligned} 1 = \Vert \mathbf{u}_{k+1}\Vert ^2 = \Vert p_k(AA^T) A\mathbf{v}_1\Vert ^2 = \sum _{j=1}^n \gamma _j^2 \, \sigma _j^2 p_k(\sigma _j^2)^2 \end{aligned}$$

it follows that (using Lemma 6)

$$\begin{aligned} 1 \ge |\gamma _1| \, \sigma _1 \, p_k(\sigma _1^2). \end{aligned}$$

Again, if \(\gamma _1\) would be known, the largest zero of

$$\begin{aligned} f_2(t) = t \, p_k(t^2) - 1 / |\gamma _1| \end{aligned}$$

would yield an upper bound \(\sigma _\mathrm{up}\) for \(\sigma _{\max }\); where we replace the unknown \(\gamma _1\) by \(\delta \). Hence we have proved the following theorem.

Theorem 9

Assume that we have carried out \(k\) steps of Lanczos bidiagonalization with starting vector \(\mathbf{v}_1\) which has been chosen randomly with respect to the uniform distribution over \(S^{n-1}\), and let \(\varepsilon \in (0,1)\). Then the largest zero of the polynomials

$$\begin{aligned} f_1(t)&= \, q_k (t^2) - 1 / \delta \end{aligned}$$
(9)
$$\begin{aligned} f_2(t)&= t \, p_k(t^2) - 1 / \delta \end{aligned}$$
(10)

with \(\delta \) given by (8), are upper bounds for \(\Vert A\Vert \) with probability at least \(1-\varepsilon \).

In Fig. 1 we give an idea of the behavior of the polynomials \(p\) and \(q\). For \(A=\mathsf{diag(1:100)}\), we carry out 10 steps of Lanczos bidiagonalization with a random starting vector.

Fig. 1
figure 1

The Lanczos polynomials \(q_{10}(t^2)\) and \(t \, p_{10}(t^2)\) after 10 steps of Lanczos bidiagonalization, with \(\varepsilon = 0.01\). Their largest zeros determine guaranteed lower bounds for \(\Vert A\Vert \). The intersection points with the line \(1/\delta \) determine upper bounds for \(\Vert A\Vert \) with probability at least 99 %. The only difference between the two figures is the scale on the vertical axis

We take \(\varepsilon = 0.01\), then it follows from (8) that \(1/\delta \approx 792\). The largest singular value of \(B_{10}\) is \(\theta _1 \approx 99.83\), while that of \(\widehat{B}_{10}\) is \(\widehat{\theta }_1 \approx 99.86\). Determining the \(t > \theta _1\) for which \(q_{10}(t^2) = 1/\delta \) gives the probabilistic bound \(\sigma _\mathrm{up} \approx 105.87\) which is correct with probability at least 99 %. Likewise, \(t \, p_{10}(t^2) = 1/\delta \) yields \(\sigma _\mathrm{up} \approx 105.35\). We refer to Sect. 5 for many more numerical experiments.

4 Ritz Polynomials

In Sect. 2 we have also introduced, in addition to the “Lanczos” polynomials \(p_k\) and \(q_k\), the “Ritz” polynomials \(r_j= r_j^{(k)}\) and \(s_j = s_j^{(k)}\), for \(j=1,\ldots ,k\); see (5) and (6). These polynomials are associated with the approximate right and left singular vectors \(V_k\mathbf{d}_j\) and \(U_k\widehat{\mathbf{c}}_j\), which are sometimes called Ritz vectors in the context of eigenvalue problems; we will use the same terminology in this situation. We will now exploit the polynomials \(r_1\) and \(s_1\) corresponding to the largest approximate singular vectors (that is, the approximate left and right singular vectors corresponding to the largest approximate singular values \(\widehat{\theta }_1^{(k)}\) and \(\theta _1^{(k)}\), respectively). The following result is similar to Proposition 4.

Proposition 10

The polynomials \(r_1\) and \(s_1\) have positive leading coefficients and increase strictly monotonically to the right of their largest zeros \(\widehat{\theta }_2^2\) and \(\theta _2^2\), respectively.

Proof

This follows from Corollary 3 and the fact that \(r_1\) and \(s_1\) are polynomials of degree \(k-1\). \(\square \)

Recall from (5) that \(V_k\mathbf{d}_1 = s_1(A^T \! A) \, \mathbf{v}_1\) is the approximation to the right singular vector corresponding to the largest singular value \(\theta _1\) of \(B_k\), which is an approximation (more precisely, a lower bound) for \(\Vert A\Vert \). Since

$$\begin{aligned} \theta _1^2 = \Vert AV_k\mathbf{d}_1\Vert ^2 = \sum _{j=1}^n \gamma _j^2 \, \sigma _j^2 \, s_1(\sigma _j^2)^2 \end{aligned}$$

we derive

$$\begin{aligned} \theta _1 \ge |\gamma _1| \, \sigma _1 \, s_1(\sigma _1^2). \end{aligned}$$

Analogously, since from (6) we have \(U_k\widehat{\mathbf{c}}_1 = r_1(AA^T) A\mathbf{v}_1\), we get

$$\begin{aligned} \widehat{\theta }_1^2 = \Vert A^T \! U_k\widehat{\mathbf{c}}_1\Vert ^2 = \sum _{j=1}^n \gamma _j^2 \, \sigma _j^4 \, r_1(\sigma _j^2)^2 \end{aligned}$$

so

$$\begin{aligned} \widehat{\theta }_1 \ge |\gamma _1| \, \sigma _1^2 \, r_1(\sigma _1^2). \end{aligned}$$

The next result follows in a similar way as Theorem 9.

Theorem 11

Assume that the starting vector \(\mathbf{v}_1\) has been chosen randomly with respect to the uniform distribution over \(S^{n-1}\) and let \(\varepsilon \in (0,1)\). Then the largest zero of the polynomials

$$\begin{aligned} f_3(t)&= t \, s_1(t^2) - \theta _1 / \delta \end{aligned}$$
(11)
$$\begin{aligned} f_4(t)&= t^2 \, r_1(t^2) - \widehat{\theta }_1 / \delta \end{aligned}$$
(12)

with \(\delta \) given by (8), are upper bounds for \(\Vert A\Vert \) with probability at least \(1-\varepsilon \).

Remark

In [13], Chebyshev polynomials of the first kind were also studied. These polynomials on a given interval have the property that their absolute value is at most 1 on this interval and that they tend to sharply increase outside this interval. Nevertheless, experience in [13] shows that the Lanczos and Ritz polynomials, which are implicitly generated and “adapted” to the problem at hand, naturally tend to give better probabilistic bounds than “fixed” Chebyshev polynomials that only use partial information, such as the approximations \(\theta _1\) and \(\theta _k\) to the largest, respectively smallest singular value. Therefore, we do not study this type of polynomial in this paper.

5 Numerical Experiments

First, we give a pseudocode for Lanczos bidiagonalization with reorthogonalization and the computation of the probabilistic bounds.

figure a

A few remarks about the algorithm: lines 6 and 12 implement reorthogonalization in a computationally efficient way. (Although reorthogonalization turned out to be unnecessary in the experiments, we still recommend it to ensure stability.) The probabilistic bounds may be computed in line 16, but also if desired after lines 8 or 13. We propose to use polynomial \(f_2\) (see (10) and below for the motivation). As explained earlier in the paper, breakdowns as well as restarts are not included.

Experiment 1

To get an idea of the behavior of the probabilistic bounds, we first take \(n=1000, A\) = diag(1:1000), \(\varepsilon = 0.01\), and a random \(\mathbf{v}_1\) on \(S^{n-1}\) as explained before Lemma 7; see Fig. 2. Indicated are as a function of the iteration number \(k\):

  • the largest singular values \(\widehat{\theta }_1^{(k)}\) of the bidiagonal \(k \times (k+1)\) matrices \(\widehat{B}_k\), which are guaranteed lower bounds for \(\Vert A\Vert \) (dots);

  • the probabilistic upper bounds based on the polynomials \(f_1\) using the Lanczos polynomials \(q_k\) (see (9), dashed);

  • the probabilistic upper bounds based on the polynomials \(f_2\) using the Lanczos polynomials \(p_k\) (see (10), solid);

  • the probabilistic upper bounds based on the polynomials \(f_3\) using the Ritz polynomials \(s_1^{(k)}\) (see (5) and (11), dash-dotted); and

  • the probabilistic upper bounds based on the polynomials \(f_4\) using the Ritz polynomials \(r_1^{(k)}\) (see (6) and (12), dotted).

Fig. 2
figure 2

Ritz values \(\widehat{\theta }_1\) and probabilistic upper bounds (\(\varepsilon =0.01\)) for the matrix norm of \(A\) = diag(1:1000)

As may be seen and expected, the Lanczos polynomials \(p_k\) and \(q_k\) (degree \(k\), largest zero \(\widehat{\theta }_1\) and \(\theta _1\), respectively) yield better bounds than the Ritz polynomials \(r_1\) and \(s_1\) (degree \(k-1\), largest zero \(\widehat{\theta }_2\) and \(\theta _2\), respectively; recall that \(\widehat{\theta }_1 \ge \widehat{\theta }_2\) and \(\theta _1 \ge \theta _2\)). Comparing the two Lanczos polynomials, \(f_2\) with degree \(2k+1\) gives better results than the polynomial \(f_1\) with degree \(2k\); note also that the largest zero \(\widehat{\theta }_1\) of \(p_k\) is not smaller than the largest zero \(\theta _1\) of \(q_k\).

We see that for rather modest \(k\) we already obtain reasonably sharp guaranteed lower bounds and probabilistic upper bounds. Based on this experience, we will only consider the lower bounds \(\widehat{\theta }_1^{(k)}\) in the following experiments (note that \(\widehat{\theta }_1^{(k)} \ge \theta _1^{(k)}\)), and the probabilistic upper bounds derived from the polynomials \(f_2\), based on the Lanczos polynomials \(p_k\), as these tend to be sharper than those obtained with the other polynomials.

Experiment 2

We now experiment with some common SVD test matrices of relatively small size, available either from the MatrixMarket [12] or from SVDPACK,Footnote 1 to be able to compare with the exact \(\Vert A\Vert \). In Table 1 we compare the performances of Matlab’s normest, a power method on \(A^T\!A\) (third column), and Lanczos bidiagonalization (fourth column), where we allow 20 iterations in both cases, that is, 20 matrix-vector products (MVs) with \(A\) and 20 MVs with \(A^T\). As expected, Lanczos bidiagonalization always gives better results and sometimes much better results. The reason for this is that the estimation of normest is based on \(\Vert A^T\!\mathbf{w}\Vert /\Vert \mathbf{w}\Vert \), where \(\mathbf{w}= (A^T\!A)^{19}A\mathbf{v}_1\), while Lanczos bidiagonalization maximizes the same norm over all vectors \(\mathbf{w}\) in the Krylov space

$$\begin{aligned} \mathcal K _{20}(A^T\!A, \, A\mathbf{v}_1) {{:}{=} }\mathop {\text{ span }}(A\mathbf{v}_1, (A^T\!A)A\mathbf{v}_1, \ldots , (A^T\!A)^{19}A\mathbf{v}_1) \end{aligned}$$

In addition, we give the error \(\sigma _\mathrm{up}-\Vert A\Vert \), where we have computed probabilistic upper bounds \(\sigma _\mathrm{up}\) for \(\Vert A\Vert \) using \(f_2\) (see (10)) with \(\varepsilon = 0.01\), i.e., which are correct with probability at least 99 %. We see from Table 1 that the overestimation of the probabilistic upper bounds is always smaller than the the underestimation of normest; sometimes even much smaller.

Table 1 For several SVD test matrices: normest: the error \(\Vert A\Vert -\nu \), where \(\nu \) is the approximation obtained with 20 steps of the power method on \(A^T\!A\) as implemented in Matlab’s normest; bidiag: the error \(\Vert A\Vert -\widehat{\theta }_1\), where \(\widehat{\theta }_1\) is the approximation acquired with 20 steps of Lanczos bidiagonalization; bdprob: the error \(\sigma _\mathrm{up}-\Vert A\Vert \), where the probabilistic upper bound \(\sigma _\mathrm{up}\), computed after 20 steps of Lanczos bidiagonalization, is a true upper bound for \(\Vert A\Vert \) with probability at least 99 %.

Experiment 3

Next, we consider the \(11390 \times 1265\) term-by-document matrix hypatia,Footnote 2 a term-by-document matrix with 109056 nonzeros. The computation of a few of the largest singular triplets is commonly asked for such a matrix. These determine a low-rank approximation of the matrix, and the angles between the search vectors and the columns of the computed low-rank approximation are used for informational retrieval; see [2] and references. After 10 steps of Lanczos bidiagonalization applied to this matrix we get \(\widehat{\theta }_1 \approx 342.2469\) while the upper bound with probability at least 99 % is \(\widehat{\theta }_1 + 2.43 \cdot 10^{-5}\), leaving just a small interval for \(\Vert A\Vert \). The upper bound with probability at least 99.9 % is \(\widehat{\theta }_1 + 2.43 \cdot 10^{-4}\); therefore, we may have confidence in the value of \(\widehat{\theta }_1\).

Experiment 4

Finally, we take the \(23560 \times 23560\) matrix af23560 [12], with 460598 nonzeros, arising in computational fluid dynamics. Ten steps of Lanczos bidiagonalization applied to this matrix yields \(\widehat{\theta }_1 \approx 645.7\). The probabilistic upper bound with \(\varepsilon =0.01\) (probability at least 99 %) is \(\sigma _\mathrm{up} \approx 646.8\), while \(\varepsilon = 0.001\) leads to \(\sigma _\mathrm{up} \approx 652.0\). We may therefore conclude that \(\Vert A\Vert \) is in the interval \([\widehat{\theta }_1, \ 1.01 \, \widehat{\theta }_1]\) with probability at least 99.9 %. This small probabilistic interval (the lower bound \(\widehat{\theta }_1\) as well as the probabilistic upper bound) is obtained in about 0.15 second on a laptop with processor speed about \(7 \cdot 10^9\) flops/sec. This clearly indicates the usefulness of the developed probabilistic bounds for large (sparse) matrices.

6 Discussion and Conclusions

We have developed probabilistic upper bounds for the matrix norm. The bounds may be efficiently computed during or after the Lanczos bidiagonalization process. As we have seen from the experiments, Lanczos bidiagonalization with the probabilistic bounds may give very good results and may be superior to the power method on \(A^T\!A\), as for instance implemented in Matlab’s function normest, using the same number of MVs.Footnote 3 We have proposed various functions \(f_1, f_2, f_3\), and \(f_4\); for reasons described in Experiment 1 we advocate the use of \(f_2\) (see (10)).

Multiple runs of the method may also be combined to increase the reliability of the estimates. If \(\mathbf{v}_1\) and \(\widehat{\mathbf{v}}_1\) are two independently chosen random initial vectors leading to probabilistic upper bounds \(\sigma _\mathrm{up}^{(1)}\) and \(\sigma _\mathrm{up}^{(2)}\) with probability at least \(1-\varepsilon \), then \(\max \{ \sigma _\mathrm{up}^{(1)}, \, \sigma _\mathrm{up}^{(2)} \}\) is an upper bound with probability at least \(1-\varepsilon ^2\).

As many other iterative subspace methods, the proposed method is matrix-free, which means that \(A\) need not be known explicitly, as long as \(A\mathbf{v}\) and \(A^T\!\mathbf{u}\) can be computed for arbitrary vectors \(\mathbf{v}\) and \(\mathbf{u}\) of appropriate sizes.

It would be very desirable to be able to develop probabilistic upper bounds for the condition number \(\kappa (A) = \Vert A\Vert \, \Vert A^{-1}\Vert \). Unfortunately, the polynomials generated in Lanczos bidiagonalization are not useful for this, as they do not increase near the origin; in fact the polynomials are either even or odd. The Lanczos bidiagonalization process only provides guaranteed upper bounds (\(\widehat{\theta }_k\) or \(\theta _k\)) for \(\sigma _{\min }\). Indeed, finding a lower bound for the smallest singular value is known to be difficult; see, e.g., [3] and references. (Note that the results in [4] are based on expensive matrix factorizations.) In the context of Lanczos bidiagonalization, the best available “probabilistic estimate” for \(\kappa (A)\) might be \(\sigma _\mathrm{up} / \widehat{\theta }_k\), where \(\widehat{\theta }_k\) is the smallest singular value of \(\widehat{B}_k\) and \(\sigma _\mathrm{up}\) is the probabilistic upper bound of \(f_2\). However, we note that since the approximation \(\widehat{\theta }_k \approx \sigma _{\min }\) might be arbitrarily poor, this is not a bound of any type. Indeed, experiments with the matrices of Table 1 sometimes gave disappointing results (such as underestimation by a factor 1000). Further progress in reliable and inexpensive estimation of the matrix condition number would be very welcome.