Keywords

1 Introduction and Motivation from Optimization

Effective nonlinear optimization algorithms require 1st derivative information, \(\nabla f(x)\), while superlinear convergence requires some 2nd derivative approximation [12]. For example, standard Quasi-Newton (QN) methods such as BFGS (complete gradient and an approximate Hessian generated from gradient differences) have superlinear terminal convergence. Limited-Memory (LM) QN methods [11], such as LBFGS, which approximate the Hessian efficiently by storing only the most recent gradient differences, are widely used in large-scale optimization. This article formulates randomized QN like algorithms which can be used to approximate Hessians (as well as general matrices) with reduced cost.

Alternatively, consider Stochastic Gradient Descent (SGD) [14], which is a common dimension reduction technique in statistics and Machine learning. SGD minimizes the average of cost functions \(f_i:\mathbb {R}^m \rightarrow \mathbb {R}\),

$$\begin{aligned} f(x) = \frac{1}{n} \sum _{i=1}^n f_i(x), \end{aligned}$$

by approximating \(\nabla f \approx s^{-1} \sum _{i=1}^s \nabla f_i\). Here, \(f_i\) is associated with the i-th entry of a large (n entry) data or training set. The SGD approximation is simply \(\nabla f \approx F\,p\), where F is the matrix with ith column \(\nabla f_i\) and the sparse vector p has s non-zero entries of 1/s at sampled indices. Our algorithms generalize SGD by incorporating flexible sampling in \(\mathbb {R}^n\) along with sampling in the parameter space \(\mathbb {R}^m\) to provide flexibility to tune our algorithms to computational hardware.

Section 2 introduces the fundamental problem, two-sided samples and terminology. Section 3 reviews randomized one-sided Quasi-Newton algorithms while Sect. 4 develops our randomized, two-sided Quasi-Newton algorithms. Section 5 provides probabilistic convergence rates and error estimates. Section 6 numerically demonstrates the convergence of the algorithms. Section 7 incorporates an inner block power iteration to accelerate our two-sided algorithms and compares the result to one-sided algorithms based on similar heuristics.

2 Fundamental Problem, Samples, and Terminology

The fundamental problem our algorithms addresses is how to efficiently construct a sequence of approximations to a matrix, \(A\in \mathbb {R}^{m\times n}\), from a stream of incomplete and possibly noisy data. Specifically, we develop and analyze algorithms to iteratively embed aggregate information from

$$\begin{aligned} U^{\top }A\,V \in \mathbb {R}^{s_1 \times s_2}, \quad U \in \mathbb {R}^{m\times s_1}, \quad V \in \mathbb {R}^{n\times s_2}. \end{aligned}$$

These weighted linear combinations of the rows and columns of the data A are called two-sided samples. This is in contrast to weighted linear combinations of the rows, \(U^\top \,A\), or weighted linear combinations of the columns, \(A\,V\), which we refer to as one-sided samples. Two-sided samples have been used before in non-iterative algorithms: [9] compares Schatten-p norm estimates (pth root of the sum of the pth power of the singular values) using two-sided samples, \(U^{\top }A\,V\), (termed a bi-linear sketch) to estimates using one-sided samples, \(A\,V\). Large eigenvalues estimates using two-sided random projectors are examined in [1]; and two-sided samples are used in [2] to tighten bounds on low-rank approximations. We follow their lead by simply counting sample entries to estimate data cost: \(m\, s_2\) for the one-sided samples, \(A\,V\), and \(s_1\, s_2\) for the two-sided sample, \(U^{\top }A\,V\). Algorithms using two-sided samples are a subset of randomized numerical linear algebra. An overview of existing algorithms and applications is provided by the extensive list of articles citing the comprehensive review [7]. The algorithms in [1, 7] expend significant up front effort computing projections \(\varOmega \) and \(\varPsi \) so that the projected matrix, \(\varOmega ^\top \,A\,\varPsi \), approximates A on dominant eigenvalues with the goal that uniform random sampling of \(\varOmega ^\top \,A\,\varPsi \) yields good approximations to A. In contrast, our algorithms produce an improving sequence of approximations by iteratively embedding small randomized two-sided samples, \(U^\top \,A\,V\), with sample dimensions \(s_1\times s_2\) that can be chosen to suit available hardware. Throughout we compare algorithms using the cost estimates (respectively \(s_1\, s_2\) and \(m\, s_2\) for the samples \(U^\top \,A\,V\) and \(A\,V\)) from [9].

We use notation motivated by QN algorithms in non-linear optimization. SPD means symmetric positive definite and W is an SPD weight matrix. \(X^+\) denotes the Moore-Penrose pseudo-inverse of X; \(\langle X, Y \rangle _F = \mathrm Tr\left[ X^\top Y\right] \) and \(\Vert X\Vert _F^2 = \langle X, X \rangle _F\) are the Frobenius inner product and norm. For conforming SPD weights \(W_1\) and \(W_2\) the weighted Frobenius norm (with special case \(X = X^\top \) and \(W=W_1=W_2\) written \(F(W^{-1})\)) is

$$\begin{aligned} \Vert X\Vert _{F(W_{1}^{-1},W_{2}^{-1})}^2 = \Vert W_{1}^{-1/2}X\,W_{2}^{-1/2}\Vert _{F}^2. \end{aligned}$$

The W-weighted projector \(\mathcal {P}\), which projects onto the column space of \(W\,U\),

$$\begin{aligned} \mathcal {P} = P_{W^{-1},U} = W\,U(U^\top \,W\,U)^{-1}U^{\top }, \end{aligned}$$
(1)

satisfies \(\mathcal {P} \, W = W \, \mathcal {P}^\top = \mathcal {P} \, W \, \mathcal {P}^\top \quad \text{ and } \quad W^{-1}\mathcal {P} = \mathcal {P}^\top W^{-1} = \mathcal {P}^\top \, W^{-1} \, \mathcal {P}\).

3 Randomized One-Sided Quasi-Newton Algorithms

Our iterative approximations to A using two-sided samples are motivated by the one-sided sampled algorithms in [6] and QN optimization algorithms. Classical QN schemes for SPD matrices A are formulated as constrained minimum change updates for \(B\approx A\) or \(H\approx A^{-1}\) in weighted Frobenius norms [12]: the constraint enforces the new information while the minimum change condition stabilizes the update. The one-sided sampled update algorithms in [5, 6] are given by the KKT [8, 12] equations (with particular choices of weight W) for the quadratic programs

$$\begin{aligned} B_{k+1}&=\mathop {\mathrm {arg\,min}}\limits _{B} \left\{ \frac{1}{2}\Vert B-B_k\Vert _{F(W^{-1})}^2 \mid B\,U_k = A\, U_k \text { and } B=B^\top \right\} , \end{aligned}$$
(2)
$$\begin{aligned} H_{k+1}&=\mathop {\mathrm {arg\,min}}_{H} \left\{ \frac{1}{2}\Vert H-H_k\Vert _{F(W^{-1})}^2 \mid U_k = H\, A\, U_k \text { and } H=H^\top \right\} . \end{aligned}$$
(3)

The analytical updates defined by Eqs. (2) and (3), are

$$\begin{aligned} B_{k+1}&= B_k + \mathcal {P}_B (A - B_k) + (A - B_k)\mathcal {P}_B^\top - \mathcal {P}_B(A - B_k)\mathcal {P}_B^\top , \end{aligned}$$
(4)
$$\begin{aligned} H_{k+1}&= H_k + \mathcal {P}_H (A^{-1} - H_k) + (A^{-1} - H_k)\mathcal {P}_H^\top - \mathcal {P}_H(A^{-1} - H_k)\mathcal {P}_H^\top , \end{aligned}$$
(5)

where the weighted projectors \(\mathcal {P}_B\) and \(\mathcal {P}_H\) defined by Eq. (1) are

$$\begin{aligned} \mathcal {P}_B&= P_{W^{-1},U_k} = W\,U_k(U_k^\top \,W\,U_k)^{-1}U_k^\top , \\ \mathcal {P}_H&= P_{W^{-1},A\,U_k} = W\,A\,U_k(U_k^\top \,A\,W\,A\,U_k)^{-1}U_k^\top \,A. \end{aligned}$$

Note, these are two different updates using the same one-sided sample \(A \, U_k\) which are not simply connected by the Sherman-Morrison-Woodbury (SMW) formula. In Eq. (4), \(B_{k+1}\) is an improved approximation to A while in Eq. (5), \(H_{k+1}\) is an improved approximation to \(A^{-1}\). Familiar algorithms are obtained by selecting different weights, W. Block DFP [15] is Eq. (4) with \(W = A\)

$$\begin{aligned} B_{k+1} = \left( I_n-\mathcal {P}_{\text {DFP}}\right) \,B_k(I_n-\mathcal {P}_{\text {DFP}}^\top ) + \mathcal {P}_{\text {DFP}} \, A, \end{aligned}$$

where \( \mathcal {P}_{\text {DFP}} = P_{A^{-1},U_k} = A \,U_k(U_k^\top \, A \,U_k)^{-1}U_k^\top \). Block BFGS [5, 6] is the result of inverting Eq. (4) with \(W=A^{-1}\) using the SMW formula

$$\begin{aligned} B_{k+1} = B_k - B_k U_k \left( U_k^\top B_k U_k \right) ^{-1} U_k^\top B_k + A \,U_k \left( U_k^\top A U_k\right) ^{-1} U_k^\top \,A. \end{aligned}$$

QN algorithms are commonly initialized with multiples of the identity.

4 Randomized Two-Sided Quasi-Newton Algorithms

A general algorithm (defined by SPD weight matrices \(W_1\) and \(W_2\)) to approximate non-square matrices and two distinct algorithms specialized to symmetric matrices are developed. As with one-sided sampled algorithms, different weights give different algorithms. Algorithms and theorems are developed for a generic initialization \(B_0\).

4.1 General Two-Sided Sampled Update

Analogous to Eq. (2), our first algorithm is defined by the minimization

$$\begin{aligned} \displaystyle B_{k+1} = \mathop {\mathrm {arg\,min}}_{B} \left\{ \frac{1}{2}\Vert B-B_k\Vert _{F(W_{1}^{-1},W_{2}^{-1})}^2 \mid U_k^\top B \,V_k = U_k^\top A\,V_k \right\} . \end{aligned}$$
(6)

Solving the KKT equations for Eq. (6) gives the self-correcting update

$$\begin{aligned} B_{k+1} = B_k + P_{W_1^{-1},U_k} (A - B_k) P^\top _{W_2^{-1},V_k}. \end{aligned}$$
(7)

Since this update explicitly corrects the projected residual sample \( R_k = U_k^\top (A-B_k)\, V_k\), it decreases the weighted Frobenius norm \(\Vert A-B_k\Vert _{F(W_{1}^{-1},W_{2}^{-1})}^2\) unless the approximation is correct on the sampled spaces, i.e., \(U_k^\top (A-B_k)\, V_k = 0\).

Given \(A\in \mathbb {R}^{m\times n}\), initial approximation \(B_0\in \mathbb {R}^{m\times n}\), two-sided sample sizes \(\{s_1, s_2\}\), and SPD weights \(\{W_{1},W_2\}\), Eq. (7) generates a sequence \(\{B_k\}\) that converges monotonically to A in the appropriate weighted Frobenius norm. Pseudocode is provided in Algorithm 1: boxed values give the two-sided sample size per iteration; double boxed values the total for all iterations. For symmetric A, the independent left and right hand sampling fails to preserve symmetry.

figure a

4.2 Symmetric Update

Unsurprisingly, the fully symmetrized general algorithm (\(A=A^\top ,B_0=B_0^\top \), \(V_k=U_k\) and \(W=W_1=W_2\)) give symmetric approximations. Pseudocode is provided in Algorithm 2 with sample counts boxed as before.

figure b

Remark 1

Algorithm 2 can give non-SPD updates from SPD input e.g.

$$\begin{aligned} W=\begin{bmatrix} 1 &{} 0\\ 0 &{} 1 \end{bmatrix}, \quad A= \begin{bmatrix} 1 &{} 0\\ 0 &{} 1 \end{bmatrix}, \quad B = \begin{bmatrix} 1 &{} 0 \\ 0 &{} 9 \end{bmatrix}, \quad \text{ and } \quad U=\frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \end{bmatrix}. \end{aligned}$$

4.3 Multi-step Symmetric Updates

Enforcing symmetry for \(A=A^\top \) and \(B_0 = B_0^\top \) with an internal step

$$\begin{aligned} \begin{aligned} B_{k+1/2}&= B_k + P_{W^{-1},U_k} (A - B_k) P^\top _{W^{-1},V_k}\\ B_{k+1}&= \frac{1}{2}\left( B_{k+1/2} + B_{k+1/2}^\top \right) \end{aligned} \end{aligned}$$

gives convergence comparable to Algorithm 2. However, the two-step algorithm

$$\begin{aligned} \begin{aligned} B_{k+1/3}&= B_k + P_{W_1^{-1},U_k} (A - B_k) P^\top _{W_2^{-1},V_k}\\ B_{k+2/3}&= B_{k+1/3} + P_{W_2^{-1},V_k} (A - B_{k+1/3}^\top ) P^\top _{W_1^{-1},U_k}\\ B_{k+1}&= \frac{1}{2}\left( B_{k+2/3} + B_{k+2/3}^\top \right) , \end{aligned} \end{aligned}$$
(8)

has superior convergence properties and requires no additional data since

$$\begin{aligned} P_{W_2^{-1},V_k} A\, P^\top _{W_1^{-1},U_k} = \left( P_{W_1^{-1},U_k} A\, P^\top _{W_2^{-1},V_k}\right) ^\top . \end{aligned}$$

Pseudocode is provided in Algorithm 3 with sample counts boxed as before.

figure c

5 Convergence Analysis

Our convergence results rely on properties of randomly generated projectors. In our experiments, we orthogonalize square matrices with entries drawn from N(0, 1) to generate rotations from a rotationally invariant distribution [16]. Our algorithms use symmetric rank s projectors defined by an SPD weight W

$$\begin{aligned} \hat{z} = W^{1/2}U(U^\top W U)^{-1}U^\top \,W^{1/2}, \end{aligned}$$

where U is simply the first s columns of a random rotation. The smallest and largest eigenvalues \(\lambda _1\) and \(\lambda _n\) of the expectation \(\mathbf{E} [\hat{z}]\) of these random projections determines convergence of our algorithms with optimal rates when \(\lambda _1=\lambda _n\).

Definition 1

A random matrix, \(\hat{X} \in \mathbb {R}^{m \times n}\), is rotationally invariant if the distribution of \(Q_m\,\hat{X}\,Q_n\) is the same for all rotations \(Q_i \in \mathcal {O}(i)\).

Proposition 1

Let \(\mathcal {Z}\) be any distribution of real, rank s projectors in \(\mathbb {R}^n\). Then,

$$\begin{aligned} 0 \le \lambda _{\min }(\mathbf{E} [\hat{z}]) \le \frac{s}{n}\le \lambda _{\max }(\mathbf{E} [\hat{z}]) \le 1, \quad \hat{z}\in \mathcal {Z}. \end{aligned}$$

Further, if \(\hat{z}\) is rotationally invariant, then \(\mathbf{E} [\hat{z}] = \frac{s}{n}I_n\).

Proposition 2

For \(R \in \mathbb {R}^{m \times n}\) and conforming symmetric projections \(\hat{y},\hat{z}\),

$$\begin{aligned} \begin{aligned} \langle R \, \hat{z}, R \, \hat{z} \rangle _F&= \langle R, R \, \hat{z} \rangle _F\\ \langle \hat{y} \, R \, \hat{z}, \hat{y} \, R \, \hat{z} \rangle _F&= \langle \hat{y} \, R \, \hat{z}, R \, \hat{z} \rangle _F = \langle \hat{y} \, R \, \hat{z}, R \rangle _F \end{aligned} \end{aligned}$$
(9)

Proposition 3

For any \(R \in \mathbb {R}^{m \times n}\) and conforming symmetric positive semi-definite matrices \(S_1\), \(S_2\), and (in the special case \(m=n\) ) S we have the bounds:

$$\begin{aligned} \lambda _{\min }(S_1) \langle R, R \rangle _F&\le \langle S_1 \, R, R \rangle _F \le \lambda _{\max }(S_1) \langle R, R \rangle _F, \\ \lambda _{\min }(S_2) \langle R, R \rangle _F&\le \langle R, R\, S_2 \rangle _F \le \lambda _{\max }(S_2)\langle R, R \rangle _F,\\ \lambda _{\min }(S)^2 \langle R, R \rangle _F&\le \langle S \, R, R\, S \rangle _F \le \lambda _{\max }(S)^2\langle R, R \rangle _F. \end{aligned}$$

Remark 2

Convergence results for Algorithms 1 to 3. are for \(\mathbf{E} [\Vert B-A\Vert _F^2]\). Such results dominate similar results for \(\Vert \mathbf{E} [B-A]\Vert _F^2\) since

$$\begin{aligned} \left\| \mathbf{E} \left[ B-A\right] \right\| _F^2 = \mathbf{E} \left[ \left\| B-A\right\| _F^2\right] - \mathbf{E} \left[ \left\| B-\mathbf{E} \left[ B\right] \right\| _F^2\right] . \end{aligned}$$

Theorem 1

(Convergence of Algorithm 1 - NS). For \(A \in \mathbb {R}^{m \times n}\) and \(B_0 \in \mathbb {R}^{m \times n}\) with \(W_1 \in \mathbb {R}^{m \times m}\) and \(W_2 \in \mathbb {R}^{n \times n}\) fixed SPD weights. If \(U_{k} \in \mathbb {R}^{m \times s_1}\) and \(V_{k} \in \mathbb {R}^{n \times s_2}\) are random, independently selected orthogonal matrices with full column rank (with probability one), then \(B_{k}\) from Algorithm 1 satisfies

$$\begin{aligned} \mathbf{E} \left[ \left\| B_{k+1}-A\right\| _{F(W_{1}^{-1},W_{2}^{-1})}^2\right] \le (\rho _{\text {NS}})^k\mathbf{E} \left[ \Vert B_{0}-A\Vert _{F(W_{1}^{-1},W_{2}^{-1})}^2\right] , \end{aligned}$$

where \(\rho _{\text {NS}} = 1-\lambda _{\min }(\mathbf{E} [\hat{y}])\, \lambda _{\min }(\mathbf{E} [\hat{z}])\), with

$$\begin{aligned} \hat{y}_{k} = W_{1}^{1/2}U_{k}(U_{k}^\top \,W_{1}U_{k})^{-1}U_{k}^\top \,W_{1}^{1/2},\quad \hat{z}_{k} = W_{2}^{1/2}V_{k}(V_{k}^\top \,W_{2}V_{k})^{-1}V_{k}^\top \,W_{2}^{1/2}. \end{aligned}$$

Proof

Define the kth residual as \(R_k:=W_{1}^{-1/2}(B_k-A)W_{2}^{-1/2}\). With some algebraic manipulation, Eq. (7) can be re-written as \(R_{k+1} = R_k - \hat{y}_{k}R_{k}\hat{z}_{k}\). Computing the squared Frobenius norm of both sides,

$$\begin{aligned}&\langle R_{k+1}, R_{k+1} \rangle _F = \langle R_{k}-\hat{y}_{k}R_{k}\hat{z}_{k}, R_{k}-\hat{y}_{k}R_{k}\hat{z}_{k} \rangle _F \\&= \langle R_{k}, R_{k} \rangle _F - \langle R_{k}, \hat{y}_{k}R_{k}\hat{z}_{k} \rangle _F - \langle \hat{y}_{k}R_{k}\hat{z}_{k}, R_{k} \rangle _F + \langle \hat{y}_{k}R_{k}\hat{z}_{k}, \hat{y}_{k}R_{k}\hat{z}_{k} \rangle _F \\&= \langle R_{k}, R_{k} \rangle _F - \langle \hat{y}_{k}R_{k}\hat{z}_{k}, R_{k}\hat{z}_{k} \rangle _F, \end{aligned}$$

where we have made use of Proposition 2. Taking the expected value with respect to independent samples \(U_{k}\) (leaving \(V_{k}\) and \(R_k\) fixed) gives

$$\begin{aligned} \begin{aligned}&\mathbf{E} \left[ \left\| R_{k+1}\right\| _F^2 \mid V_{k},R_k \right] = \langle R_k, R_k \rangle _F - \langle \mathbf{E} [\hat{y}_{k}] R_{k}\hat{z}_{k}, R_{k}\hat{z}_{k} \rangle _F \\&\le \langle R_k, R_k \rangle _F - \lambda _{\min }(\mathbf{E} [\hat{y}_{k}]) \langle R_{k}\hat{z}_{k}, R_{k}\hat{z}_{k} \rangle _F \le \langle R_k, R_k \rangle _F - \lambda _{\min }(\mathbf{E} [\hat{y}_{k}]) \langle R_{k}, R_{k}\hat{z}_{k} \rangle _F, \end{aligned} \end{aligned}$$

where we applied Proposition 3 to the symmetric positive semi-definite matrix \(\mathbf{E} [\hat{y}_{k}]\), and used Eq. (9). Taking the expected value with respect to independent samples \(V_{k}\) and leaving \(R_k\) fixed gives

$$\begin{aligned} \mathbf{E} [\Vert R_{k+1} \Vert _F^2 \mid R_k]&\le \langle R_k, R_k \rangle _F - \lambda _{\min }(\mathbf{E} [\hat{y}_{k}]) \langle R_{k}, R_{k}{} \mathbf{E} [\hat{z}_{k}] \rangle _F \\&\le \langle R_k, R_k \rangle _F - \lambda _{\min }(\mathbf{E} [\hat{y}_{k}]) \lambda _{\min }(\mathbf{E} [\hat{z}_{k}])\langle R_{k}, R_{k} \rangle _F. \end{aligned}$$

Taking the full expectation and noting \(\mathbf{E} [\Vert R_{k+1}\Vert _F^2] = \mathbf{E} [\Vert B_k - A\Vert _{F(W_1^{-1},W_2^{-1})}^2\)

$$\begin{aligned} \mathbf{E} [\Vert R_{k+1}\Vert _F^2]&\le \mathbf{E} \left[ \langle R_k, R_k \rangle _F\right] - \lambda _{\min }(\mathbf{E} [\hat{y}_{k}])\lambda _{\min }(\mathbf{E} [\hat{z}_{k}]) \mathbf{E} \left[ \langle R_{k}, R_{k} \rangle _F\right] \\&= (1- \lambda _{\min }(\mathbf{E} [\hat{y}_{k}]) \lambda _{\min }(\mathbf{E} [\hat{z}_{k}]))\mathbf{E} [\langle R_{k}, R_{k} \rangle _F] \end{aligned}$$

gives the result by unrolling the recurrence. Note, independence of \(U_{k}\) and \(V_{k}\) justifies \(\mathbf{E} [\langle \hat{y}_{k}R_{k}\hat{z}_{k}, R_{k}\hat{z}_{k} \rangle _F] = \langle \mathbf{E} [\hat{y}_{k}]R_{k}\hat{z}_{k}, R_{k}\hat{z}_{k} \rangle _F\).

Theorem 2

(Convergence of Algorithm 2 - SS1). Let \(A, W \in \mathbb {R}^{n \times n}\) be fixed SPD matrices and \(U_k \in \mathbb {R}^{n\times s}\) be a randomly selected matrix having full column rank with probability 1. If \(B_0 \in \mathbb {R}^{n \times n}\) is an initial guess for A with \(B_0 = B_0^\top \), then \(B_{k}\) from Algorithm 2 satisfies

$$\begin{aligned} \mathbf{E} [\Vert B_{k+1}-A\Vert _{F(W^{-1})}^2] \le (\rho _{\text {SS1}})^k \mathbf{E} [\Vert B_{0}-A\Vert _{F(W^{-1})}^2], \end{aligned}$$

where \(\rho _{\text {SS1}} = 1-\lambda _{\min }(\mathbf{E} [\hat{z}])^2\) and \(\hat{z}_{k} = W^{1/2}U_{k}(U_{k}^\top \,W \, U_{k})^{-1}U_{k}^\top W^{1/2}\).

Proof

Following similar steps outlined in the proof in Theorem 1, we arrive at

$$\begin{aligned} \langle R_{k+1}, R_{k+1} \rangle _F = \langle R_{k}, R_{k} \rangle _F - \langle R_k, \hat{z}_{k}R_{k}\hat{z}_{k} \rangle _F. \end{aligned}$$

Taking the expected value with respect to \(U_k\) leaving \(R_k\) fixed we have

$$\begin{aligned}&\mathbf {E}\left[ \Vert R_{k+1}\Vert _{F}^2 \mid R_k \right] = \langle R_k, R_k \rangle _F - \mathbf {E}\left[ \langle R_k, \hat{z}_{k}R_k\hat{z}_{k} \rangle _F\right] \\&= \langle R_k, R_k \rangle _F - \mathbf {E}\left[ \mathrm Tr[R_k^\top \hat{z}_{k} R_k \hat{z}_{k}]\right] \\&= \langle R_k, R_k \rangle _F - \mathrm Tr\left[ \mathbf {E}\left[ R_k \hat{z}_{k} R_k \hat{z}_{k}\right] \right] \le \langle R_k, R_k \rangle _F - \mathrm Tr\left[ \mathbf {E}\left[ R_k \hat{z}_{k}\right] ^2\right] , \end{aligned}$$

where the inequality arises from application of Jensen’s Inequality. Simplifying and applying Proposition 3,

$$\begin{aligned}&\mathbf {E}[\Vert R_{k+1}\Vert _{F(W^{-1})}^2 \mid R_k] \le \langle R_k, R_k \rangle _F - \mathrm Tr\left[ \mathbf {E}\left[ R_k \hat{z}_{k}\right] ^2\right] \\&= \langle R_k, R_k \rangle _F - \mathrm Tr\left[ R_k\mathbf {E}\left[ \hat{z}_{k}\right] R_k \mathbf {E}\left[ \hat{z}_{k}\right] \right] \\&= \langle R_k, R_k \rangle _F - \langle \mathbf {E}[\hat{z}_{k}]R_k, R_k\mathbf {E}[\hat{z}_{k}] \rangle _F \le \langle R_k, R_k \rangle _F - \lambda _{\min }(\mathbf {E}[\hat{z}_{k}])^2 \langle R_k, R_k \rangle _F. \end{aligned}$$

Taking the full expectation and un-rolling the recurrence yields the desired result.

Theorem 3

(Convergence of Algorithm 3 - SS2). Let \(A,U_k,V_k\) and \(B_0\) be defined as in Theorem 1, and let W be a fixed SPD matrix then \(B_{k}\) from Algorithm 3 satisfies

$$\begin{aligned} \mathbf{E} \left[ \left\| B_{k}-A\right\| _{F(W^{-1})}^2\right] \le (\rho _{\text {SS2}})^k\mathbf{E} \left[ \Vert B_{0}-A\Vert _{F(W^{-1})}^2\right] , \end{aligned}$$

where

$$\begin{aligned} \rho _{\text {SS2}} = 1-2\lambda _{\min }(\mathbf{E} [\hat{y}]) \lambda _{\min }(\mathbf{E} [\hat{z}]) + \lambda _{\min }(\mathbf{E} [\hat{y}])^2 \lambda _{\min }(\mathbf{E} [\hat{z}])^2. \end{aligned}$$

Proof

Let \(R_k\) be the kth residual \(R_k\), and \(\hat{y}_{k}, \hat{z}_{k}\) be projectors as in Theorem 1 with \(W = W_1 = W_2\). Eq. (8) can be re-written in terms of \(R_k\) as follows.

$$\begin{aligned} R_{k+1/3} = R_{k} - \hat{y}_{k}R_{k}\hat{z}_{k}, \quad R^\top _{k+2/3} = R^\top _{k+1/3} - \hat{z}_{k}R_{k+1/3}^\top \hat{y}_{k}, \\ R_{k+1} = \frac{1}{2}\left( R_{k+2/3} + R^\top _{k+2/3} \right) . \end{aligned}$$

Theorem 1 gives

$$\begin{aligned} \mathbf{E} \left[ \left\| R_{k+1/3}\right\| _{F}^2\right] \le (\rho _{\text {NS}})\mathbf{E} \left[ \Vert R_k\Vert _{F}^2\right] , \end{aligned}$$

and a repeated application of Theorem 1 gives

$$\begin{aligned} \mathbf{E} \left[ \left\| R_{k+2/3}\right\| _{F}^2\right] \le (\rho _{\text {NS}})\mathbf{E} \left[ \Vert R_{k+1/3}\Vert _{F}^2\right] \le (\rho _{\text {NS}})^2\mathbf{E} \left[ \Vert R_{k}\Vert _{F}^2\right] . \end{aligned}$$

Lastly, we observe via the triangle inequality that

$$\begin{aligned}&\mathbf{E} \left[ \left\| R_{k+1}\right\| _{F}^2\right] = \mathbf{E} \left[ \left\| \frac{1}{2}\left( R_{k+2/3} + R_{k+2/3}^{\top } \right) \right\| _{F}^2\right] \\&\le \frac{1}{2} \mathbf{E} \left[ \left\| R_{k+2/3}\right\| _{F}^2\right] + \frac{1}{2}{} \mathbf{E} \left[ \left\| R_{k+2/3}^\top \right\| _{F}^2\right] = (\rho _{\text {NS}})^2\mathbf{E} \left[ \Vert R_k\Vert _{F}^2\right] , \end{aligned}$$

Un-rolling the loop for k iterations gives the desired result.

With the relevant rate \(\rho \) below error bounds for Algorithms 1 to 3 are

$$\begin{aligned} \Vert R_{k+1}\Vert _{F(W_{1}^{-1},W_{2}^{-1})}^2 \le \rho \Vert R_k\Vert _{F(W_{1}^{-1},W_{2}^{-1})}^2 \end{aligned}$$
(10)

where \(y_1 = \lambda _{\min }(\mathbf{E} [\hat{y}])\), \(z_1 = \lambda _{\min }(\mathbf{E} [\hat{z}])\), and

$$\begin{aligned} \rho _{\text {NS}}(y_1,z_1)&= 1 - y_1 z_1, \quad \rho _{\text {SS1}}(z_1) = 1- z_1^2, \quad \rho _{\text {SS2}}(y_1,z_1) = (1- y_1 z_1)^2. \end{aligned}$$
(11)

Since any symmetric rank s random projection \(\hat{z}\) on \(\mathbb {R}^n\) satisfies \(0\le z_1 \le \frac{s}{n} \le z_n \le 1 \) and rotationally invariant distributions, e.g. \(U U^+\) with \(U\sim N(0,1)^{n \times s}\), further satisfy \(\mathbf{E} [\hat{z}] = \frac{s}{n}\), minimizing the various convergence rates \(\rho \) over the appropriate domains gives the following optimal rates.

Corollary 1

The optimal convergence rates for Algorithms 1 to 3 are obtained attained for \(U_k\) and \(V_k\) sampled from rotationally invariant distributions,

$$\begin{aligned} {\begin{matrix} \rho _{\text {NS}}^{\text {opt}} = 1 - \frac{s_1}{m} \frac{s_2}{n}, \quad \rho _{\text {SS1}}^{\text {opt}} = 1- \left( \frac{s_2}{n}\right) ^2, \quad \rho _{\text {SS2}}^{\text {opt}} = \left( 1- \frac{s_1}{m} \frac{s_2}{n} \right) ^2. \end{matrix}} \end{aligned}$$
(12)

Remark 3

Theorems 1 to 3 all assume the weight matrix W and distributions are fixed. All our non-accelerated numerical experiments use fixed weights and sample from fixed rotationally invariant distributions.

Remark 4

Corollary 1 is an extremely strong result. Consider for simplicity \(s_1=s_2=s\). Although the convergence rates are \(\sim 1-\left( \frac{s}{n}\right) ^2\), only \(s\times s\) aggregated pieces of information are used each iteration. If a one-sided sampled algorithm uses \(s \times n\) pieces of information, e.g. [6], our algorithm can take \(\frac{n}{s}\) iterations with the same amount of information. Consequently the error decrease after \(\frac{n}{s}\) iterations, is comparable to convergence rates of one-sided sampled QN methods.

$$\begin{aligned} \left( 1-\frac{s^2}{n^2}\right) ^{n/s} \approx 1- \frac{n}{s}\cdot \frac{s^2}{n^2}, \end{aligned}$$

Lower bounds on the convergence rates (analogous to the upper bounds in Theorems 1 to 3 but using the upper bounds in Proposition 3) are easily derived. For example, the two-sided error bound for Algorithm 1 is

$$\begin{aligned} \rho _{\text {NS}}(y_m, z_n)\mathbf{E} [\Vert R_{k}\Vert _F^2] \le \mathbf{E} [\Vert R_{k+1}\Vert _{F}^2] \le \rho _{\text {NS}}(y_1,z_1)\mathbf{E} [\Vert R_{k}\Vert _{F}^2], \end{aligned}$$

where as before \(y_1 \le y_2 \le \cdots \le y_m\) is the spectrum of \(\mathbf{E} [\hat{y}]\), \(z_1 \le z_2 \le \cdots \le z_n\) is the spectrum of \(\mathbf{E} [\hat{z}]\) and the explicit form for \(\rho _{\text {NS}}\) is in Eq. (11). We collect the similar results for Algorithms 1 to 3 in Corollary 2.

Corollary 2

(Two-Sided Convergence Rates). Given the assumptions of Theorems 1 to 3 the explicit formulas Eq. (11) for \(\rho \) give two-sided bounds,

$$\begin{aligned} \rho _{\text {NS}}(y_m,z_n)^k&\le \frac{\mathbf{E }\left[ \left\| B_{k+1}-A\right\| ^2_{F(W_{1}^{-1},W_{2}^{-1})}\right] }{\Vert B_{0}-A\Vert ^2_{F(W_{1}^{-1},W_{2}^{-1})}} \le \rho _{\text {NS}}(y_1,z_1)^k \\ \rho _{\text {SS1}}(z_n)^k&\le \frac{\mathbf{E }\left[ \left\| B_{k+1}-A\right\| ^2_{F(W^{-1})}\right] }{\Vert B_{0}-A\Vert ^2_{F(W^{-1})}} \le \rho _{\text {SS1}}(z_1)^k \\ \rho _{\text {SS2}}(y_n,z_n)^k&\le \frac{\mathbf{E }\left[ \left\| B_{k+1}-A\right\| ^2_{F(W^{-1})}\right] }{\Vert B_{0}-A\Vert ^2_{F(W^{-1})}} \le \rho _{\text {SS2}}(y_1,z_1)^k \end{aligned}$$

where \(y_1,y_m, z_1, z_n\) are the extreme eigenvalues of \(\mathbf{E} [\hat{y}]\) and \(\mathbf{E} [\hat{z}]\).

Remark 5

If \(\hat{y}\) and \(\hat{z}\) are rotationally invariant, the upper and lower probabilistic bounds in Corollary 2 coincide since \(z_1=z_n = \frac{s_1}{n}\) and \(y_1=y_m = \frac{s_2}{m}\). Algorithms 1 to 3 all use rotationally invariant distributions and converge predictably at the expected rate. The algorithms still converge with other distributions provided the smallest eigenvalue of the expectation is positive.

6 Numerical Results

Algorithm 1 to 3 were implemented in the MATLAB framework from [6] and tested on representative SPD matrices from the same article: \(A=XX^\top \) with \(X \sim \mathcal {N}(0,1)^{n \times n}\); the Gisette-Scale ridge regression matrix from [3]; and the NASA matrix from [4]. The author’s website [13] contains MATLAB scripts and similar results for all matrices from [6]. Computations were performed on Superior, the HPC facility at Michigan Technological University.

Many metrics can be used to objectively compare algorithmic costs. Common metrics include number of FLOPS, total memory used, communication overhead, and for matrix-free black-box procedures the number of individual matrix-vector products \(A\, v\). As noted by [9] the analogous metric for a black box procedure to compute the matrix product required for our \(s_1 \times s_2\) two-sided sample \(U^\top A \, V\) is the number of sampled entries, \(s_1 \, s_2\). As an explicit example, for \(f:\mathbb {R}^m\rightarrow \mathbb {R}\), Forward-Forward mode [10] Algorithmic Differentiation (AD) simultaneously computes f(x) and a directional 2nd derivative \(u^\top \nabla ^2f(x)\,v\). With sufficient shared-memory processors, AD can efficiently compute f(x) and the two-sided sample \(U^\top \nabla ^2f(x)\, V\) of the Hessian with cost \(s_1 \, s_2\).

Algorithms in [6] are for symmetric matrices. We compare the convergence of (unweighted i.e. \(W=I\)) Algorithms 1 to 3 (sample size \(s=\lceil \sqrt{n}\rceil \) matching [6]) to the one sided algorithms in [6]: Fig. 1a for \(A = X\,X^\top \) with \(X \sim \mathcal {N}(0,1)^{5000\times 5000}\); Fig. 1b for Gisette-Scale; and Fig. 1c) for NASA. Our algorithms achieve the theoretical convergence rates from Eq. (12) (dotted lines). Weighted algorithms DFP and BFGS use \(B_0=I\), un-weighted Algorithms 1 to 3 use \(B_0=0\). Runs were terminated after \(5n^2\) iterations or when the relative residual norm fell below 0.01. The one-sided algorithms DFP and BFGS have target dependent weight matrices: DFP is Eq. (4) with weight \(W=A\) while BFGS is the Sherman-Morrison-Woodbury inversion of Eq. (5) with \(W=A^{-1}\). Figure 1a shows our algorithms outperforming both BFGS and DFP for small tolerances. Figure 1b shows enhanced initial convergence for DFP and BFGS, but Fig. 1c demonstrates that BFGS may not converge. In contrast, our two-sided algorithms converge consistently, achieving the theoretical convergence rates.

Fig. 1.
figure 1

Two-sided sampled algorithm performance with theoretical rate as dots. BFGS and DFP can be: a) comparable; b) superior; or c) stall/diverge.

7 Heuristic Accelerated Schemes

Algorithm 2 corrects the symmetric projected residual \(U_k^\top (A-B_k)U_k\) at each stage; significant corrections occur if \(U_k\) aligns with large eigenvalues of \(R_k\). Block power iteration is a standard heuristic [7] to enhance alignment.

Incorporating p steps of a block power iteration to enrich \(U_k\) produces the hybrid algorithm in Algorithm 2: the loop from line 4 to line 9 enriches a random U by multiplying by the residual and re-orthogonalizing p times. As before, work estimates are boxed on the right (p block power iterations require \(p\,n\,s\) and the square symmetric sample requires \(s^2\)) with the total double boxed. Although the inner iteration requires significantly more matrix samples per iteration, conventional wisdom [7] suggests one or two inner iterations are likely to be beneficial. Our experiments show Algorithm 2 is competitive for \(p=2\).

figure d

Acceleration Convergence Results

Algorithm 4 (rotationally invariant samples with \(p=2\)) is compared to BFGS-A (the result of applying the SMW formula to the accelerated method AdaRBFGS from [6]) and the three one-sided, non-accelerated algorithms: S1 and DFP defined by Eq. (4) with weights \(W=I\) and \(W = A\) (respectively), and BFGS defined by applying the SMW formula to Eq. (5) with weight \(W=A^{-1}\). We use the same test matrices, initialization, and termination conditions described in Sect. 6: Fig. 2a shows results for \(A=X X^\top \); Fig. 2b shows results for the Hessian matrix Gisette Scale [3]; and Fig. 2c shows results for NASA4704 [4]. SS1A matches or outperforms all other algorithms for the three matrices. As before [13] contains MATLAB scripts and results for all matrices from [6]. Both BFGS-A and SS1A adaptively sample their update spaces. BFGS-A samples from the columns of the Cholesky decomposition of \(B_k\) while SS1-A effectively samples from a small power of the residual \(A-B_k\). Comparing BFGS to BFGS-A and S1 to SS1-A shows the benefits of adaptivity. The hybrid SS1A performs consistently well.

Fig. 2.
figure 2

Accelerated SS1A algorithm outperforms accelerated BFGS algorithms.

8 Conclusions and Future Work

Algorithms 1 to 3 iteratively generate matrix approximations to a fixed target from two-sided samples. Rotationally invariant sampling gives optimal theoretical convergence in general and predicted convergence rates are experimentally verified for several real world matrices, with comparable performance to existing one-sided algorithms. A hybrid method combining simultaneous iteration (to enrich a subspace) with the two-sided sampled update is developed and shown to be competitive with existing one-sided accelerated schemes.

The algorithms systematically make minimal changes and drive weighted residual norms for a fixed A monotonically to zero. Such self-correcting algorithms can potentially approximate slowly changing matrices, A(x). For example, QN optimization algorithms have a slowly changing Hessian target \(\nabla _x^2 f(x_k)\) while solvers for stiff ODEs \(y'(x) = f(y(x))\) have a slowly changing Jacobian target \(\nabla _y f(y(x_k))\). The two-sided sampled matrix approximation algorithms and theory presented in the article provide a general foundation for these and other applications. Efficient factorized updates, compact low rank approximations, inverse approximation, and sparse matrix sampling are all planned.