1 Introduction

We consider solving large-scale systems of linear equations represented by a matrix \(A\in {\mathbb {R}}^{m\times n}\) and vector \(\mathbf {b}\in {\mathbb {R}}^m\); we use the convention that vectors are bold type, and matrices and scalars are not. We are interested in the highly overdetermined setting, where \(m \gg n\), which means the system need not necessarily have a solution. Iterative solvers like the Kaczmarz method [7, 17], Motzkin’s method [1, 4, 10], and the Gauss-Seidel method [8, 9] have become re-popularized recently for such problems since they are particularly efficient in terms of computation and storage.

The Kaczmarz method is a popular iterative solver for overdetermined systems of linear equations and is especially preferred for large-scale systems since it need not ever load the entire system into memory at once. The method consists of sequential orthogonal projections toward the solution set of a single equation (or subsystem). Given the system \(A\mathbf {x} = \mathbf {b}\), the method computes iterates by projecting onto the hyperplane defined by the equation \(\mathbf {a}_i^T \mathbf {x} = b_i\) where \(\mathbf {a}_i^T\) is a selected row of the matrix A and \(b_i\) is the corresponding entry of \(\mathbf {b}\). The iterates are recursively defined as

$$\begin{aligned} \mathbf {x}_{k+1} = \mathbf {x}_k + \frac{b_i - \mathbf {a}_i^T\mathbf {x}_k}{\Vert \mathbf {a}_i\Vert ^2} \mathbf {a}_i \end{aligned}$$
(1.1)

where \(\mathbf {a}_i^T\) is selected from among the rows of A (and the initialization \(\mathbf {x}_0\) is chosen arbitrarily). The seminal work of Strohmer and Vershynin [17] proved exponential convergence for the randomized Kaczmarz method where the ith row \(\mathbf {a}_i^T\) is chosen with probability \(\Vert \mathbf {a}_i\Vert ^2/\Vert A\Vert _F^2\). Since then many variants of the method have been proposed and analyzed for various types of systems, see e.g., [3, 5, 6, 13,14,15, 19] and references therein.

It is known that the randomized Kaczmarz method converges for inconsistent systems (or equivalently those corrupted by noise) with an error threshold dependent on A and the noise. In [11] it was shown that this method has iterates that satisfy:

$$\begin{aligned} {\mathbb {E}}\Vert \mathbf {x}_k - \mathbf {x}_{\text {LS}}\Vert ^2 \le \left( 1 - \frac{\sigma _{\text {min}}^2(A)}{\Vert A\Vert _F^2}\right) ^k \Vert \mathbf {x}_0 - \mathbf {x}_{\text {LS}}\Vert ^2 + \frac{\Vert A\Vert _F^2}{\sigma _{\text {min}}^2(A)}\Vert \mathbf {e}\Vert _{\infty }^2, \end{aligned}$$
(1.2)

where here and throughout, the norm without subscript, \(\Vert \cdot \Vert \), denotes the Euclidean norm, \(\sigma _{\text {min}}(A)\) denotes the minimum singular value of A, \(\Vert A\Vert _F\) its Frobenius norm, \( \mathbf {x}_{\text {LS}}\) the least squares solution and \(\mathbf {e} = \mathbf {b} - A \mathbf {x}_{\text {LS}}\) denotes the error term. There are variants of this method that converge to the least squares solution, e.g. [19] that utilizes an additional projection step to project off the error term \(\mathbf {e}\). Additionally, it is known that if a linear system of equations or inequalities is feasible then randomized Kaczmarz will provide a proof or certificate of feasibility, and there are probabilistic guarantees on how quickly it will do so [4].

A related but seemingly disjointedly studied work is an approach by Agmon [1], and Motzkin and Schoenberg [10], re-imagined a few years later as the now famous perceptron algorithm [16]. These approaches are most often used for feasibility problems, where one seeks a point that resides within some polyhedron described by a system of inequalities; of course, linear systems of equations are one special instance. Additionally, this so-called Motzkin method has been referred to as the Kaczmarz method with the “most violated constraint” or “maximal-residual” control [2, 13, 14]. As these descriptors suggest, this method iterates in a similar fashion as the Kaczmarz method, but rather than selecting a row of A in sequential or randomized order, it selects the row corresponding to the most violated constraint, as described in Algorithm 1. Starting from any initial point \(\mathbf {x}_0\), the method proceeds as follows. If the current point \(\mathbf {x}_k\) is a solution, the method terminates; otherwise there must be a constraint \(\mathbf {a}_i^T \mathbf {x} = b_i\) that is most violated. The constraint defines a hyperplane H. The method then projects \(\mathbf {x}_k\) onto this hyperplane as in (1.1), or perhaps under/over projects using an alternate step-size, see [4, 10] for details. Selecting the most violated constraint is intuitive for feasibility problems or for solving consistent linear systems of equations. In the inconsistent case, it may not always make sense to project onto the most violated constraint; see Fig. 1 for a simple example of this situation. However, following [4], we present experimental evidence that suggests Motzkin’s method often offers an initially accelerated convergence rate, both for consistent and inconsistent systems of equations.

Fig. 1
figure 1

An example of a series of projections using the Motzkin approach on an inconsistent system. Lines represent the hyperplanes consisting of sets \(\{\mathbf {x}: \mathbf {a}_i^T \mathbf {x} = b_i\}\) for rows \(\mathbf {a}_i^T\) of A, and \(\mathbf {x}^*\) denotes the desired solution

figure a

1.1 Contribution

We show that Motzkin’s method for systems of linear equations features an initially accelerated convergence rate when the residual has a large dynamic range. We provide bounds for the iterate error which depend on the dynamic range of the residual. These bounds can potentially be used when designing stopping criteria or hybrid approaches. Next, for a concrete example we show that Gaussian systems of linear equations have large dynamic range and provide bounds on this value. We extend this to a corollary which shows that the initial convergence rate is highly accelerated and our theoretical bound closely matches experimental evidence.

2 Accelerated convergence of Motzkin’s method

The advantage of the Motzkin method is that by greedily selecting the most violated constraint, the method makes large moves at each iteration, thereby accelerating convergence. One drawback of course, is that it is computationally expensive to compute which constraint is most violated. For this reason, De Loera et al. [4] proposed a hybrid batched variant of the method that randomly selects a batch of rows and then computes the most violated from that batch. This method is quite fast when using parallel computation, but the method often offers accelerated convergence that outweighs the increased computational cost even without parallelization techniques. When the system is inconsistent, however, there is an additional drawback to the Motzkin method because projecting onto the most violated constraint need not move the iterate closer to the desired solution, as already mentioned and shown in Fig. 1. Our first lemma provides a rule for deciding if a greedy projection offers desirable improvement. Here and throughout, we assume that the matrix A has been normalized to have unit row norm, \(\Vert \mathbf {a}_i\Vert ^2 = 1\), and that the matrix has full column rank, n.

Lemma 2.1

Let \(\mathbf {x}\) denote any desired solution of the system given by matrix A and right hand side \(\mathbf {b}\). If \(\mathbf {e} = A\mathbf {x}- \mathbf {b}\) and \(\Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty > 4\Vert \mathbf {e}\Vert _\infty \) then the next iterate, \(\mathbf {x}_{k+1}\) defined by Algorithm 1 satisfies

$$\begin{aligned} \Vert \mathbf {x}_{k+1} - \mathbf {x}\Vert ^2 \le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{1}{2} \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2. \end{aligned}$$

Proof

By definition of \(\mathbf {x}_{k+1}\), we have

$$\begin{aligned} \Vert \mathbf {x}_{k+1} - \mathbf {x}\Vert ^2&= \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - 2(\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k-b_{i_{k+1}})(\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k - b_{i_{k+1}} - e_{i_{k+1}})\nonumber \\&\quad + (\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k - b_{i_{k+1}})^2 \nonumber \\&= \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - (\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k - b_{i_{k+1}})^2 + 2(\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k-b_{i_{k+1}})e_{i_{k+1}}\nonumber \\&\le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - (\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k - b_{i_{k+1}})^2 + 2|\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k-b_{i_{k+1}}|\cdot |e_{i_{k+1}}|\nonumber \\&= \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2 + 2 \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty |e_{i_{k+1}}|\nonumber \\&\le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2 + 2 \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \Vert \mathbf {e}\Vert _\infty \nonumber \\&\le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{1}{2} \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2. \end{aligned}$$
(2.1)

\(\square \)

Note that this tells us that while our residual is still large relative to the error, Motzkin’s method can offer good progress in each iteration. Also, this progress is better than the expected progress offered by Randomized Kaczmarz (RK) when the residual has good dynamic range, in particular when:

$$\begin{aligned} \frac{1}{2} \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2 > \frac{1}{m} \Vert A\mathbf {x}_k - \mathbf {b}\Vert ^2. \end{aligned}$$

We can use Lemma 2.1 to easily obtain the following corollary.

Corollary 2.1

Let \(\mathbf {x}\) denote any desired solution of the system given by matrix A and right hand side \(\mathbf {b}\) and write \(\mathbf {e} = A\mathbf {x}-\mathbf {b}\) as the error term. Then for any given iteration k, the iterate defined by Algorithm 1 satisfies either (i) or both (ii) and (iii), where

$$\begin{aligned}&\text {(i)}\quad \Vert \mathbf {x}_{k+1} - \mathbf {x}\Vert ^2 \le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{1}{2} \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2\\&\text {(ii)}\quad \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 \le 25m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2\\&\text {(iii)}\quad \Vert \mathbf {x}_{k+1} - \mathbf {x}\Vert ^2 \le \left( 25m\sigma _{\min }^{-2}(A) + 8\right) \Vert \mathbf {e}\Vert _{\infty }^2. \end{aligned}$$

In addition, if the method is run for K iterations with the stopping criterion \(\Vert A\mathbf {x}_K - \mathbf {b}\Vert _\infty \le 4\Vert \mathbf {e}\Vert _\infty \), then the method exhibits the (possibly highly accelerated) convergence rate

$$\begin{aligned} \Vert \mathbf {x}_{K} - \mathbf {x}\Vert ^2&\le \prod _{k=0}^{K-1}\left( 1 - \frac{\sigma _{\min }^2(A)}{4\gamma _k}\right) \cdot \Vert \mathbf {x}_0 - \mathbf {x}\Vert ^2 + 2m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2, \end{aligned}$$
(2.2)
$$\begin{aligned}&\le \left( 1 - \frac{\sigma _{\min }^2(A)}{4m}\right) ^K\Vert \mathbf {x}_0 - \mathbf {x}\Vert ^2 + 2m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2, \end{aligned}$$
(2.3)

with final error satisfying (ii). Here \(\gamma _k\) bounds the dynamic range of the kth residual, \(\gamma _k := \frac{\Vert A\mathbf {x}_k-A\mathbf {x}\Vert ^2}{\Vert A\mathbf {x}_k-A\mathbf {x}\Vert _{\infty }^2}\).

Proof

We consider two cases, depending on whether \(\Vert A\mathbf {x}_k -\mathbf {b}\Vert _\infty > 4\Vert \mathbf {e}\Vert _\infty \) or \(\Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \le 4\Vert \mathbf {e}\Vert _\infty \). If the former holds, then (i) is valid by Lemma 2.1. If instead the latter holds, then we first obtain (ii) by the simple argument

$$\begin{aligned} \Vert \mathbf {x}_{k} - \mathbf {x}\Vert ^2&\le \sigma _{\min }^{-2}(A)\Vert A\mathbf {x}_k - A\mathbf {x}\Vert ^2\\&\le \sigma _{\min }^{-2}(A)m\Vert A\mathbf {x}_{k} - A\mathbf {x}\Vert _{\infty }^2\\&\le \sigma _{\min }^{-2}(A)m\left( \Vert A\mathbf {x}_{k} - \mathbf {b}\Vert _{\infty }^2 + 2\Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \Vert \mathbf {e}\Vert _\infty + \Vert \mathbf {e}\Vert _{\infty }^2\right) \\&\le \sigma _{\min }^{-2}(A)m\left( 16\Vert \mathbf {e}\Vert _{\infty }^2 + 8 \Vert \mathbf {e}\Vert _\infty ^2 + \Vert \mathbf {e}\Vert _{\infty }^2\right) \\&= 25m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2. \end{aligned}$$

To obtain (iii) still in this latter case, we continue from (2.1) showing

$$\begin{aligned} \Vert \mathbf {x}_{k+1} - \mathbf {x}\Vert ^2&\le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2 + 2 \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \Vert \mathbf {e}\Vert _\infty \\&\le 25m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2 - \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2 + 2 \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \Vert \mathbf {e}\Vert _\infty \\&\le 25m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2 + 2 \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \Vert \mathbf {e}\Vert _\infty \\&\le 25m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2 + 8 \Vert \mathbf {e}\Vert _\infty ^2\\&= \left( 25m\sigma _{\min }^{-2}(A)+ 8 \right) \Vert \mathbf {e}\Vert _\infty ^2. \end{aligned}$$

To prove (2.2) and (2.3), we first note that by choice of stopping criterion, (i) holds for all \(0\le k \le K\). Thus for all such k, we have

$$\begin{aligned} \Vert \mathbf {x}_{k} - \mathbf {x}\Vert ^2&\le \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 - \frac{1}{2}\Vert A\mathbf {x}_{k-1} - \mathbf {b}\Vert _\infty ^2\nonumber \\&= \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 - \frac{1}{2}\Vert (A\mathbf {x}_{k-1} - A\mathbf {x}) - \mathbf {e}\Vert _\infty ^2\nonumber \\&\le \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 - \frac{1}{4}\Vert A\mathbf {x}_{k-1} - A\mathbf {x}\Vert _\infty ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2 \end{aligned}$$
(2.4)
$$\begin{aligned}&= \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 - \frac{1}{4\gamma _{k-1}}\Vert A\mathbf {x}_{k-1} - A\mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2 \nonumber \\&\le \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 - \frac{\sigma _{\min }^{2}(A)}{4\gamma _{k-1}}\Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2\nonumber \\&= \left( 1 - \frac{\sigma _{\min }^{2}(A)}{4\gamma _{k-1}}\right) \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2, \end{aligned}$$
(2.5)

where the first line follows from (i), the third from Jensen’s inequality, and the fifth from properties of singular values.

Iterating the relation given by (2.5) recursively yieldsFootnote 1

$$\begin{aligned} \Vert \mathbf {x}_{K} - \mathbf {x}\Vert ^2&\le \prod _{k=0}^{K-1}\left( 1 - \frac{\sigma _{\min }^{2}(A)}{4\gamma _k}\right) \cdot \Vert \mathbf {x}_{0} - \mathbf {x}\Vert ^2 + \sum _{j=0}^{K-1} \prod _{k=0}^{j-1}\left( 1 - \frac{\sigma _{\min }^{2}(A)}{\gamma _k}\right) \frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2\\&\le \prod _{k=0}^{K-1}\left( 1 - \frac{\sigma _{\min }^{2}(A)}{4\gamma _k}\right) \cdot \Vert \mathbf {x}_{0} - \mathbf {x}\Vert ^2 + \sum _{j=0}^{K-1}\left( 1 - \frac{\sigma _{\min }^{2}(A)}{4m}\right) ^j\frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2\\&\le \prod _{k=0}^{K-1}\left( 1 - \frac{\sigma _{\min }^{2}(A)}{4\gamma _k}\right) \cdot \Vert \mathbf {x}_{0} - \mathbf {x}\Vert ^2 + 2m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2\\&\le \left( 1 - \frac{\sigma _{\min }^{2}(A)}{4m}\right) ^K\Vert \mathbf {x}_{0} - \mathbf {x}\Vert ^2 + 2m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2, \end{aligned}$$

where the second and fourth inequalities follow from the simple bound \(\gamma _k \le {m}\) and the third by bounding above by the infinite sum. The last two inequalities complete the proof of (2.2) and (2.3). \(\square \)

Note that Lemma 2.1 and Corollary 2.1 are true for any desired solution, \(\mathbf {x}\). Here the desired solution could be the least squares solution or generally any other point. However, the residual of the desired solution, \(A\mathbf {x} - \mathbf {b}\), determines the error \(\mathbf {e}\) and the final error of Motzkin’s method.

Fig. 2
figure 2

Convergence of Motzkin’s method and RK on correlated system with corresponding theoretical bounds

We note that the convergence rate given by (2.2) yields a significant improvement over that given by (1.2) when the dynamic range of many residuals is large, i.e. when \(\gamma _k \ll m\) for many iterations k. In Fig. 2, we present the convergence of Motzkin and RK on a random system which before normalization is defined by matrix \(A \in {\mathbb {R}}^{5000 \times 100}\) with \(a_{ij} \sim {\mathcal {N}}(1,0.5)\) and \(\mathbf {b} = A{\mathbf {1}} + {\varvec{\varepsilon }}\) where \({\mathbf {1}}\) denotes the all ones vector and \({\varvec{\varepsilon }}\) is a Gaussian vector, and the corresponding theoretical bounds. Figure 3 presents plots providing the convergence of Motzkin and RK, and the corresponding theoretical bounds on systems of equations defined by problems from the Netlib linear programming benchmark set [12]. These problems contain naturally under-determined systems, which we transform into overdetermined, inconsistent systems with nearly the same least-squares solution. We transform the problem, originally given by the underdetermined systems of equations \(A\mathbf {x} = \mathbf {b}\) by adding equations to form

$$\begin{aligned} \begin{bmatrix} A \\ I \end{bmatrix} \mathbf {x} = \begin{bmatrix} \mathbf {b} \\ \mathbf {x}_{LS} + {\varvec{\varepsilon }} \end{bmatrix} \end{aligned}$$

where \(\mathbf {x}_{LS}\) is the least-norm solution of \(A\mathbf {x} = \mathbf {b}\) and \({\varvec{\varepsilon }}\) is a Gaussian vector with small variance, and normalizing the resulting system. Each problem has very small error which is distributed relatively uniformly, thus there are many iterations in which the theoretical bounds hold. The resulting matrix for problem agg is of size \(1103 \times 615\), the resulting matrix for problem agg2 is of size \(1274 \times 758\), the resulting matrix for problem agg3 is also of size \(1274 \times 758\), and the resulting matrix for problem bandm is of size \(777 \times 472\). These plots are only for the iterations before the stopping criterion is met.

Fig. 3
figure 3

Convergence of Motzkin’s method and RK, and corresponding theoretical bounds for Netlib linear programming problems. Upper left: agg; upper right: agg2; lower left: agg3; lower right: bandm

In Table 1, we include the CPU computation time (computed with the Matlab function cputime) required to reach residual norm, \(\Vert A\mathbf {x}_k-\mathbf {b}\Vert _\infty \), less than \(4\Vert A\mathbf {x}_{\text {LS}} -\mathbf {b}\Vert _\infty \); that is to compute \(\mathbf {x}_k\) with \(\Vert A\mathbf {x}_k-\mathbf {b}\Vert _\infty \le 4 \Vert A\mathbf {x}_{\text {LS}}-\mathbf {b}\Vert _\infty \). We include computation times averaged over 10 trials for Motzkin’s method and the Randomized Kaczmarz method on the Netlib problems agg, agg2, agg3, and bandm. Note that this computation is performed with no parallelization implemented for Motzkin’s method, which means that each iteration of Motzkin’s method is much more costly than that of RK. Nevertheless, Motzkin’s method outperforms RK on some of the selected Netlib problems. However, this is not the focus of this paper, as the acceleration described in Lemma 2.1 does not necessarily guarantee Motzkin’s method a computational advantage if the iterations are significantly more costly than those of RK.

This acceleration is in force until the stopping criterion given in the corollary. This bound therefore, can be used to design such stopping criteria; one could design an approach for example that utilizes the Motzkin method until reaching this threshold, and then switching to the traditional RK selection strategy to reduce the convergence horizon. In Fig. 4, we see that Motzkin outperforms RK for the initial iterations (while \(\Vert A \mathbf {x}_k - \mathbf {b}\Vert _\infty \gg \Vert \mathbf {e}\Vert _\infty \)) on a system with Gaussian noise. Here, before normalization, the system consists of Gaussian matrix \(A \in {\mathbb {R}}^{50000 \times 100}\) and right-hand side \(\mathbf {b} = A{\mathbf {1}} + \mathbf {e}\) where \({\mathbf {1}}\) is the vector of all ones and \(\mathbf {e}\) is a Gaussian vector. However, for a system with sparse, large magnitude error, Motzkin does not perform as well in the long run, as it suffers from a worse convergence horizon than RK. Here, before normalization, the system consists of Gaussian matrix \(A \in {\mathbb {R}}^{50{,}000 \times 100}\) and right-hand side \(\mathbf {b} = A{\mathbf {1}} + 15\sum _{j \in S} \mathbf {e}_j\) where \(\mathbf {e}_j\) denotes the jth coordinate vector and S is a uniform random sample of 50 indices.

Table 1 Average CPU computation times (s) required to compute iterate \(\mathbf {x}_k\) with \(\Vert A\mathbf {x}_k-\mathbf {b}\Vert _\infty \le 4\Vert A\mathbf {x}_{\text {LS}}-\mathbf {b}\Vert _\infty \) for the four Netlib problems, agg, agg2, agg3, and bandm. These values are averaged over 10 trials
Fig. 4
figure 4

Left: Motzkin’s method versus RK distance from least-squares solution for a Gaussian system with Gaussian noise. Right: Motzkin’s method versus RK distance from least-squares solution for a Gaussian system with sparse, ‘spiky’ noise

To capitalize on this accelerated convergence, one needs knowledge of an upper bound \(\Vert \mathbf {e}\Vert _{\infty } \le \beta \), in which case the stopping criterion of \(\Vert A\mathbf {x}_k - \mathbf {b}\Vert _{\infty } \le 4\beta \) guarantees the accelerated convergence of (2.2) and a final error of \(\Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 \le 25m\sigma _{\min }^{-2}(A)\beta ^2\). Indeed, one quickly verifies that when \(\Vert A\mathbf {x}_k - \mathbf {b}\Vert _{\infty } \le 4\beta \), we have

$$\begin{aligned} \Vert \mathbf {x}_k - \mathbf {x}\Vert&\le \sigma _{\min }^{-1}(A)\Vert A\mathbf {x}_k - A\mathbf {x}\Vert \\&\le \sqrt{m}\sigma _{\min }^{-1}(A)\Vert A\mathbf {x}_k - A\mathbf {x}\Vert _{\infty }\\&\le \sqrt{m}\sigma _{\min }^{-1}(A)\left( \Vert A\mathbf {x}_k - \mathbf {b}\Vert _{\infty } + \Vert \mathbf {e}\Vert _{\infty }\right) \\&\le \sqrt{m}\sigma _{\min }^{-1}(A)\left( 4\beta + \beta \right) \!. \end{aligned}$$
Fig. 5
figure 5

Left: Average \(\gamma _k\) values for various choices of row dimension, m, of normalized Gaussian \(A \in {{\mathrm{{\mathbb {R}}}}}^{m \times 100}\). Right: An example of the values \(\gamma _k\) for a single run of Motzkin’s method and the corresponding ratio for RK, the matrix is a \(50{,}000 \times 100\) Gaussian. The index \(i_k\) denotes the index chosen in the kth iteration of each method. The horizontal lines denote the values m and \({m/\log (m)}\). We see acceleration when \(\gamma _k < {m}\)

Since the acceleration of the method occurs when many of the terms \(\gamma _k\) are small, we plot an example in Fig. 5. As expected, many terms are bounded away from m. We will analyze this in the Gaussian case further below.

We also only expect this acceleration to be present while the condition of Lemma 2.1 is in force (i.e. prior to the stopping condition given in the corollary). Once the condition of Lemma 2.1 is no longer satisfied, selecting greedily will select those entries of the residual which have large contribution from the error, moving the estimation far from the desired solution. While the difference between greedy selection and randomized selection is not so drastic for Gaussian noise, it will be drastically different for a sparse error. We include an example system in Fig. 6 to assist with intuition. Again, one could of course implement the Kaczmarz approach after an initial use of the Motzkin method as a strategy to gain acceleration without sacrificing convergence horizon. In Fig. 7, we present the convergence of Motzkin’s method, the RK method, and a hybrid method which consists of Motzkin iterations until \(\Vert A\mathbf {x}_k - \mathbf {b}\Vert _{\infty } \le 4\Vert \mathbf {e}\Vert _\infty \), followed by RK iterations. Again we include results on both a system with Gaussian error and a system with a sparse, ‘spiky’ error, with the systems generated as in Fig. 4.

Fig. 6
figure 6

An example of three iterations of Motzkin’s method (\(x_k^M\)) and three iterations of RK (\(x_k^{RK}\)) on a Gaussian system with sparse, ‘spiky’ error. More of the RK iterations are near the least squares solution while Motzkin consistently selects the corrupted equation

Fig. 7
figure 7

Left: Motzkin’s method, RK, and hybrid distance from least-squares solution for a Gaussian system with Gaussian noise. Right: Motzkin’s method, RK, and hybrid distance from least-squares solution for a Gaussian system with sparse, ‘spiky’ noise

2.1 Heuristics for the Gaussian case

Here, we study heuristics for our convergence results for the Gaussian matrix case. Note that our results hold for matrices with normalized rows. For simplicity however, we will consider an \(m\times n\) matrix whose entries are i.i.d. Gaussian with mean 0 and variance 1 / n. We will then assume we are in the asymptotic regime where this distribution approximates a Gaussian matrix with normalized unit-norm rowsFootnote 2 To that end, we assume m and n both grow linearly with respect to one another, and that they are both substantially large.

Define \(I_k\) to be the rows of A that are independent from \(\mathbf {x}_k\) and note that \(I_k \subseteq I_{k-1} \subseteq ... \subseteq I_1 \subseteq I_0 = [m]\). Fix iteration k and define \(m' = m - |I_k|\). Note that \(m-k \le m' \le m\) is the dimension of the sub-matrix whose rows are independent of the iterates up to iteration k. Throughout this section \({\mathbb {P}}\) and \({\mathbb {E}}\) refer to probability and expectation taken with respect to the random and unsampled portion of the matrix A, \(A_{I_k}\), which has \(m'\) rows.

Our first lemma gives a bound on the expected dynamic range for a Gaussian matrix.

Lemma 2.2

If \(A \in {{\mathrm{{\mathbb {R}}}}}^{m \times n}\) is a Gaussian matrix with \(a_{ij} \sim {\mathcal {N}}(0,1/n)\) and \(\mathbf {x}\) is independent of at least \(m'\) rows of A (e.g. constructed via k iterations of Motzkin’s method) then

$$\begin{aligned} \frac{{\mathbb {E}}\Vert A\mathbf {x}\Vert ^2}{{\mathbb {E}}\Vert A\mathbf {x}\Vert _\infty ^2} \lesssim \frac{n(m' + \sum _{i \not \in I_k} \Vert \mathbf {a}_i\Vert ^2)}{\log (m')}. \end{aligned}$$

Proof

First note that

$$\begin{aligned} {\mathbb {E}}(\sum _{i=1}^m (\mathbf {a}_i^T\mathbf {x})^2)&= \sum _{i=1}^m {\mathbb {E}} (\mathbf {a}_i^T\mathbf {x})^2\\&\le \sum _{i=1}^m {\mathbb {E}}(\Vert \mathbf {a}_i\Vert ^2 \Vert \mathbf {x}\Vert ^2)&\text {by Cauchy-Schwartz}\\&\le \sum _{i \in I_k} {\mathbb {E}}(\Vert \mathbf {a}_i\Vert ^2 \Vert \mathbf {x}\Vert ^2) + \sum _{i \not \in I_k} \Vert \mathbf {a}_i\Vert ^2 \Vert \mathbf {x}\Vert ^2\\&= (m' + \sum _{i \not \in I_k} \Vert \mathbf {a}_i\Vert ^2) \Vert \mathbf {x}\Vert ^2. \end{aligned}$$

Next, note that if \(\mathbf {a}_i\) and \(\mathbf {x}\) are independent then \(\mathbf {a}_i^T\mathbf {x} \sim {\mathcal {N}}(0,\Vert \mathbf {x}\Vert ^2/n)\). Then

$$\begin{aligned} {\mathbb {E}}(\max _{i \in [m]} (\mathbf {a}_i^T\mathbf {x})^2)&\ge {\mathbb {E}}(\max _{i \in I_k} (\mathbf {a}_i^T\mathbf {x})^2)\\&\ge {\mathbb {E}}(\max _{i \in I_k} \mathbf {a}_i^T\mathbf {x})^2\\&\ge ({\mathbb {E}}\max _{i \in I_k} \mathbf {a}_i^T\mathbf {x})^2&\text {by Jensen's inequality}\\&\ge \frac{c\Vert \mathbf {x}\Vert ^2\log (m')}{n}, \end{aligned}$$

as it is commonly known that \({\mathbb {E}}(\max _{i \in [N]} X_i) \ge c \sigma \sqrt{log N}\) for \(X_i \sim {\mathcal {N}}(0,\sigma ^2)\). Thus, we have

$$\begin{aligned} {\mathbb {E}}\Vert A\mathbf {x}\Vert _\infty ^2 \ge \frac{c \Vert \mathbf {x}\Vert ^2 \log (m')}{n} \ge c \frac{\log (m')}{n(m' + \sum _{i \not \in I_k}\Vert \mathbf {a}_i\Vert ^2)}{\mathbb {E}}\Vert A\mathbf {x}\Vert ^2. \end{aligned}$$

\(\square \)

We can use this lemma along with our main result to obtain the following.

Corollary 2.2

Let \(A \in {{\mathrm{{\mathbb {R}}}}}^{m \times n}\) be a normalized Gaussian matrix as described previously, \(\mathbf {x}\) denote the desired solution of the system given by matrix A and right hand side \(\mathbf {b}\), write \(\mathbf {e} = A\mathbf {x}-\mathbf {b}\) as the error term and assume \(\mathbf {x}_0\) is chosen so that \(\mathbf {x}_0 - \mathbf {x}\) is independent of the rows of A, \(\mathbf {a}_i^T\). If Algorithm 1 is run with stopping criterion \(\Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \le 4\Vert \mathbf {e}\Vert _\infty \), in expectation the method exhibits the accelerated convergence rate

$$\begin{aligned} {\mathbb {E}} \Vert \mathbf {x}_{k+1}-\mathbf {x}\Vert ^2 \lesssim {\mathbb {E}}\Bigg [\Bigg (1 - \frac{\log (m') \sigma _{\min }^2(A)}{4nm}\Bigg ) \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _\infty ^2\Bigg ]. \end{aligned}$$
(2.6)
Fig. 8
figure 8

Convergence of Motzkin’s method and RK on Gaussian system with corresponding theoretical rate for RK and conjectured rate for Motzkin’s method

Proof

Beginning from line (2.4) of the proof of Corollary 2.1 and taking expectation of both sides, we have

$$\begin{aligned} {\mathbb {E}}\Vert \mathbf {x}_{k+1}-\mathbf {x}\Vert ^2&\le {\mathbb {E}} \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{1}{4}{\mathbb {E}} \Vert A(\mathbf {x}_k-\mathbf {x})\Vert _\infty ^2 + \frac{1}{2}{\mathbb {E}}\Vert \mathbf {e}\Vert _\infty ^2\\&\lesssim {\mathbb {E}} \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{\log (m')}{4n(m'+\sum _{i \not \in I_k}\Vert \mathbf {a}_i\Vert ^2)}{\mathbb {E}} \Vert A(\mathbf {x}_k-\mathbf {x})\Vert ^2 + \frac{1}{2}{\mathbb {E}}\Vert \mathbf {e}\Vert _\infty ^2\\&= {\mathbb {E}} \Bigg [\Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{ \log (m')}{4nm} \Vert A\mathbf {x}_k-A\mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _\infty ^2\Bigg ]\\&\le {\mathbb {E}} \Bigg [\Bigg (1 - \frac{\log (m')\sigma _{\min }^2(A)}{4nm}\Bigg ) \Vert \mathbf {x}_k -\mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _\infty ^2\Bigg ] \end{aligned}$$

where the second inequality follows from Lemma 2.2 and the fourth from properties of singular values. \(\square \)

This corollary implies a logarithmic improvement in the convergence rate if \(n<< \log (m')\), at least initially. Of course, we conjecture that the \(\log (m')\) term in (2.6) is an artifact of the proof and could actually be replaced with \(\log (m)\). Additionally, we conjecture that the n in (2.6) is an artifact of the proof. This is supported by the experiments shown in Figs. 5 and 8. Before normalization, the system for the experiment plotted in Fig. 8 is defined by Gaussian \(A \in {\mathbb {R}}^{50000 \times 100}\) and \(\mathbf {b} = \mathbf {e}\) where \(\mathbf {e}\) is a Gaussian vector. Furthermore, Corollary 5.35 of [18] provides a lower bound for the size of the smallest singular value of A with high probability, \({\mathbb {P}}\left( \sigma _{\min }(A) \le \sqrt{m/n} - 1 - t/\sqrt{n}\right) \le 2e^{-t^2/2}.\) That is, asymptotically \(\sigma _{\min }(A)\) is tightly centered around \(\sqrt{m/n} - 1\).

3 Conclusion

We have provided a theoretical analysis for Motzkin’s method for inconsistent systems. We show that by using such a greedy selection strategy, the method exhibits an accelerated convergence rate until a particular threshold is reached. This threshold depends on the dynamic range of the residual, and could be estimated to employ a strategy that yields acceleration without sacrificing convergence accuracy. We provide experiments and concrete analysis for Gaussian systems that support our claims. Future work includes a detailed analysis for other types of relevant systems, theoretical guarantees when estimating the residual, and the study of computational tradeoffs as in the framework of [4]. While Motzkin’s method can be more computationally expensive than RK, understanding the accelerated convergence per iteration will aid in an analysis of computational tradeoffs for the methods in [4]; in addition, it may offer significant advantages in parallel architectures. These are important directions for future work.