On Motzkin’s method for inconsistent linear systems

Haddock, Jamie; Needell, Deanna

doi:10.1007/s10543-018-0737-6

On Motzkin’s method for inconsistent linear systems

Published: 14 November 2018

Volume 59, pages 387–401, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

BIT Numerical Mathematics Aims and scope Submit manuscript

On Motzkin’s method for inconsistent linear systems

Download PDF

560 Accesses
22 Citations
Explore all metrics

Abstract

Iterative linear solvers have gained recent popularity due to their computational efficiency and low memory footprint for large-scale linear systems. The relaxation method, or Motzkin’s method, can be viewed as an iterative method that projects the current estimation onto the solution hyperplane corresponding to the most violated constraint. Although this leads to an optimal selection strategy for consistent systems, for inconsistent least square problems, the strategy presents a tradeoff between convergence rate and solution accuracy. We provide a theoretical analysis that shows Motzkin’s method offers an initially accelerated convergence rate and this acceleration depends on the dynamic range of the residual. We quantify this acceleration for Gaussian systems as a concrete example. Lastly, we include experimental evidence on real and synthetic systems that support the analysis.

Randomized Kaczmarz method with adaptive stepsizes for inconsistent linear systems

Article 01 June 2023

The Best Approximate Solution of the Inconsistent Linear System via a Quadratic Penalty Function

Article 29 October 2014

Modified inexact Levenberg–Marquardt methods for solving nonlinear least squares problems

Article 28 May 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

We consider solving large-scale systems of linear equations represented by a matrix $A\in {\mathbb {R}}^{m\times n}$ and vector $\mathbf {b}\in {\mathbb {R}}^m$; we use the convention that vectors are bold type, and matrices and scalars are not. We are interested in the highly overdetermined setting, where $m \gg n$, which means the system need not necessarily have a solution. Iterative solvers like the Kaczmarz method [7, 17], Motzkin’s method [1, 4, 10], and the Gauss-Seidel method [8, 9] have become re-popularized recently for such problems since they are particularly efficient in terms of computation and storage.

The Kaczmarz method is a popular iterative solver for overdetermined systems of linear equations and is especially preferred for large-scale systems since it need not ever load the entire system into memory at once. The method consists of sequential orthogonal projections toward the solution set of a single equation (or subsystem). Given the system $A\mathbf {x} = \mathbf {b}$, the method computes iterates by projecting onto the hyperplane defined by the equation $\mathbf {a}_i^T \mathbf {x} = b_i$ where $\mathbf {a}_i^T$ is a selected row of the matrix A and $b_i$ is the corresponding entry of $\mathbf {b}$. The iterates are recursively defined as

$$\begin{aligned} \mathbf {x}_{k+1} = \mathbf {x}_k + \frac{b_i - \mathbf {a}_i^T\mathbf {x}_k}{\Vert \mathbf {a}_i\Vert ^2} \mathbf {a}_i \end{aligned}$$

(1.1)

where $\mathbf {a}_i^T$ is selected from among the rows of A (and the initialization $\mathbf {x}_0$ is chosen arbitrarily). The seminal work of Strohmer and Vershynin [17] proved exponential convergence for the randomized Kaczmarz method where the ith row $\mathbf {a}_i^T$ is chosen with probability $\Vert \mathbf {a}_i\Vert ^2/\Vert A\Vert _F^2$. Since then many variants of the method have been proposed and analyzed for various types of systems, see e.g., [3, 5, 6, 13,14,15, 19] and references therein.

It is known that the randomized Kaczmarz method converges for inconsistent systems (or equivalently those corrupted by noise) with an error threshold dependent on A and the noise. In [11] it was shown that this method has iterates that satisfy:

$$\begin{aligned} {\mathbb {E}}\Vert \mathbf {x}_k - \mathbf {x}_{\text {LS}}\Vert ^2 \le \left( 1 - \frac{\sigma _{\text {min}}^2(A)}{\Vert A\Vert _F^2}\right) ^k \Vert \mathbf {x}_0 - \mathbf {x}_{\text {LS}}\Vert ^2 + \frac{\Vert A\Vert _F^2}{\sigma _{\text {min}}^2(A)}\Vert \mathbf {e}\Vert _{\infty }^2, \end{aligned}$$

(1.2)

where here and throughout, the norm without subscript, $\Vert \cdot \Vert $, denotes the Euclidean norm, $\sigma _{\text {min}}(A)$ denotes the minimum singular value of A, $\Vert A\Vert _F$ its Frobenius norm, $ \mathbf {x}_{\text {LS}}$ the least squares solution and $\mathbf {e} = \mathbf {b} - A \mathbf {x}_{\text {LS}}$ denotes the error term. There are variants of this method that converge to the least squares solution, e.g. [19] that utilizes an additional projection step to project off the error term $\mathbf {e}$. Additionally, it is known that if a linear system of equations or inequalities is feasible then randomized Kaczmarz will provide a proof or certificate of feasibility, and there are probabilistic guarantees on how quickly it will do so [4].

A related but seemingly disjointedly studied work is an approach by Agmon [1], and Motzkin and Schoenberg [10], re-imagined a few years later as the now famous perceptron algorithm [16]. These approaches are most often used for feasibility problems, where one seeks a point that resides within some polyhedron described by a system of inequalities; of course, linear systems of equations are one special instance. Additionally, this so-called Motzkin method has been referred to as the Kaczmarz method with the “most violated constraint” or “maximal-residual” control [2, 13, 14]. As these descriptors suggest, this method iterates in a similar fashion as the Kaczmarz method, but rather than selecting a row of A in sequential or randomized order, it selects the row corresponding to the most violated constraint, as described in Algorithm 1. Starting from any initial point $\mathbf {x}_0$, the method proceeds as follows. If the current point $\mathbf {x}_k$ is a solution, the method terminates; otherwise there must be a constraint $\mathbf {a}_i^T \mathbf {x} = b_i$ that is most violated. The constraint defines a hyperplane H. The method then projects $\mathbf {x}_k$ onto this hyperplane as in (1.1), or perhaps under/over projects using an alternate step-size, see [4, 10] for details. Selecting the most violated constraint is intuitive for feasibility problems or for solving consistent linear systems of equations. In the inconsistent case, it may not always make sense to project onto the most violated constraint; see Fig. 1 for a simple example of this situation. However, following [4], we present experimental evidence that suggests Motzkin’s method often offers an initially accelerated convergence rate, both for consistent and inconsistent systems of equations.

1.1 Contribution

We show that Motzkin’s method for systems of linear equations features an initially accelerated convergence rate when the residual has a large dynamic range. We provide bounds for the iterate error which depend on the dynamic range of the residual. These bounds can potentially be used when designing stopping criteria or hybrid approaches. Next, for a concrete example we show that Gaussian systems of linear equations have large dynamic range and provide bounds on this value. We extend this to a corollary which shows that the initial convergence rate is highly accelerated and our theoretical bound closely matches experimental evidence.

2 Accelerated convergence of Motzkin’s method

The advantage of the Motzkin method is that by greedily selecting the most violated constraint, the method makes large moves at each iteration, thereby accelerating convergence. One drawback of course, is that it is computationally expensive to compute which constraint is most violated. For this reason, De Loera et al. [4] proposed a hybrid batched variant of the method that randomly selects a batch of rows and then computes the most violated from that batch. This method is quite fast when using parallel computation, but the method often offers accelerated convergence that outweighs the increased computational cost even without parallelization techniques. When the system is inconsistent, however, there is an additional drawback to the Motzkin method because projecting onto the most violated constraint need not move the iterate closer to the desired solution, as already mentioned and shown in Fig. 1. Our first lemma provides a rule for deciding if a greedy projection offers desirable improvement. Here and throughout, we assume that the matrix A has been normalized to have unit row norm, $\Vert \mathbf {a}_i\Vert ^2 = 1$, and that the matrix has full column rank, n.

Lemma 2.1

Let $\mathbf {x}$ denote any desired solution of the system given by matrix A and right hand side $\mathbf {b}$. If $\mathbf {e} = A\mathbf {x}- \mathbf {b}$ and $\Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty > 4\Vert \mathbf {e}\Vert _\infty $ then the next iterate, $\mathbf {x}_{k+1}$ defined by Algorithm 1 satisfies

$$\begin{aligned} \Vert \mathbf {x}_{k+1} - \mathbf {x}\Vert ^2 \le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{1}{2} \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2. \end{aligned}$$

Proof

By definition of $\mathbf {x}_{k+1}$, we have

$$\begin{aligned} \Vert \mathbf {x}_{k+1} - \mathbf {x}\Vert ^2&= \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - 2(\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k-b_{i_{k+1}})(\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k - b_{i_{k+1}} - e_{i_{k+1}})\nonumber \\&\quad + (\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k - b_{i_{k+1}})^2 \nonumber \\&= \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - (\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k - b_{i_{k+1}})^2 + 2(\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k-b_{i_{k+1}})e_{i_{k+1}}\nonumber \\&\le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - (\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k - b_{i_{k+1}})^2 + 2|\mathbf {a}_{i_{k+1}}^T\mathbf {x}_k-b_{i_{k+1}}|\cdot |e_{i_{k+1}}|\nonumber \\&= \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2 + 2 \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty |e_{i_{k+1}}|\nonumber \\&\le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2 + 2 \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \Vert \mathbf {e}\Vert _\infty \nonumber \\&\le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{1}{2} \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2. \end{aligned}$$

(2.1)

$\square $

Note that this tells us that while our residual is still large relative to the error, Motzkin’s method can offer good progress in each iteration. Also, this progress is better than the expected progress offered by Randomized Kaczmarz (RK) when the residual has good dynamic range, in particular when:

$$\begin{aligned} \frac{1}{2} \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2 > \frac{1}{m} \Vert A\mathbf {x}_k - \mathbf {b}\Vert ^2. \end{aligned}$$

We can use Lemma 2.1 to easily obtain the following corollary.

Corollary 2.1

Let $\mathbf {x}$ denote any desired solution of the system given by matrix A and right hand side $\mathbf {b}$ and write $\mathbf {e} = A\mathbf {x}-\mathbf {b}$ as the error term. Then for any given iteration k, the iterate defined by Algorithm 1 satisfies either (i) or both (ii) and (iii), where

$$\begin{aligned}&\text {(i)}\quad \Vert \mathbf {x}_{k+1} - \mathbf {x}\Vert ^2 \le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{1}{2} \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2\\&\text {(ii)}\quad \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 \le 25m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2\\&\text {(iii)}\quad \Vert \mathbf {x}_{k+1} - \mathbf {x}\Vert ^2 \le \left( 25m\sigma _{\min }^{-2}(A) + 8\right) \Vert \mathbf {e}\Vert _{\infty }^2. \end{aligned}$$

In addition, if the method is run for K iterations with the stopping criterion $\Vert A\mathbf {x}_K - \mathbf {b}\Vert _\infty \le 4\Vert \mathbf {e}\Vert _\infty $, then the method exhibits the (possibly highly accelerated) convergence rate

$$\begin{aligned} \Vert \mathbf {x}_{K} - \mathbf {x}\Vert ^2&\le \prod _{k=0}^{K-1}\left( 1 - \frac{\sigma _{\min }^2(A)}{4\gamma _k}\right) \cdot \Vert \mathbf {x}_0 - \mathbf {x}\Vert ^2 + 2m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2, \end{aligned}$$

(2.2)

$$\begin{aligned}&\le \left( 1 - \frac{\sigma _{\min }^2(A)}{4m}\right) ^K\Vert \mathbf {x}_0 - \mathbf {x}\Vert ^2 + 2m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2, \end{aligned}$$

(2.3)

with final error satisfying (ii). Here $\gamma _k$ bounds the dynamic range of the kth residual, $\gamma _k := \frac{\Vert A\mathbf {x}_k-A\mathbf {x}\Vert ^2}{\Vert A\mathbf {x}_k-A\mathbf {x}\Vert _{\infty }^2}$.

Proof

We consider two cases, depending on whether $\Vert A\mathbf {x}_k -\mathbf {b}\Vert _\infty > 4\Vert \mathbf {e}\Vert _\infty $ or $\Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \le 4\Vert \mathbf {e}\Vert _\infty $. If the former holds, then (i) is valid by Lemma 2.1. If instead the latter holds, then we first obtain (ii) by the simple argument

$$\begin{aligned} \Vert \mathbf {x}_{k} - \mathbf {x}\Vert ^2&\le \sigma _{\min }^{-2}(A)\Vert A\mathbf {x}_k - A\mathbf {x}\Vert ^2\\&\le \sigma _{\min }^{-2}(A)m\Vert A\mathbf {x}_{k} - A\mathbf {x}\Vert _{\infty }^2\\&\le \sigma _{\min }^{-2}(A)m\left( \Vert A\mathbf {x}_{k} - \mathbf {b}\Vert _{\infty }^2 + 2\Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \Vert \mathbf {e}\Vert _\infty + \Vert \mathbf {e}\Vert _{\infty }^2\right) \\&\le \sigma _{\min }^{-2}(A)m\left( 16\Vert \mathbf {e}\Vert _{\infty }^2 + 8 \Vert \mathbf {e}\Vert _\infty ^2 + \Vert \mathbf {e}\Vert _{\infty }^2\right) \\&= 25m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2. \end{aligned}$$

To obtain (iii) still in this latter case, we continue from (2.1) showing

$$\begin{aligned} \Vert \mathbf {x}_{k+1} - \mathbf {x}\Vert ^2&\le \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2 + 2 \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \Vert \mathbf {e}\Vert _\infty \\&\le 25m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2 - \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty ^2 + 2 \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \Vert \mathbf {e}\Vert _\infty \\&\le 25m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2 + 2 \Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \Vert \mathbf {e}\Vert _\infty \\&\le 25m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2 + 8 \Vert \mathbf {e}\Vert _\infty ^2\\&= \left( 25m\sigma _{\min }^{-2}(A)+ 8 \right) \Vert \mathbf {e}\Vert _\infty ^2. \end{aligned}$$

To prove (2.2) and (2.3), we first note that by choice of stopping criterion, (i) holds for all $0\le k \le K$. Thus for all such k, we have

$$\begin{aligned} \Vert \mathbf {x}_{k} - \mathbf {x}\Vert ^2&\le \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 - \frac{1}{2}\Vert A\mathbf {x}_{k-1} - \mathbf {b}\Vert _\infty ^2\nonumber \\&= \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 - \frac{1}{2}\Vert (A\mathbf {x}_{k-1} - A\mathbf {x}) - \mathbf {e}\Vert _\infty ^2\nonumber \\&\le \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 - \frac{1}{4}\Vert A\mathbf {x}_{k-1} - A\mathbf {x}\Vert _\infty ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2 \end{aligned}$$

(2.4)

$$\begin{aligned}&= \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 - \frac{1}{4\gamma _{k-1}}\Vert A\mathbf {x}_{k-1} - A\mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2 \nonumber \\&\le \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 - \frac{\sigma _{\min }^{2}(A)}{4\gamma _{k-1}}\Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2\nonumber \\&= \left( 1 - \frac{\sigma _{\min }^{2}(A)}{4\gamma _{k-1}}\right) \Vert \mathbf {x}_{k-1} - \mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2, \end{aligned}$$

(2.5)

where the first line follows from (i), the third from Jensen’s inequality, and the fifth from properties of singular values.

Iterating the relation given by (2.5) recursively yields^{Footnote 1}

$$\begin{aligned} \Vert \mathbf {x}_{K} - \mathbf {x}\Vert ^2&\le \prod _{k=0}^{K-1}\left( 1 - \frac{\sigma _{\min }^{2}(A)}{4\gamma _k}\right) \cdot \Vert \mathbf {x}_{0} - \mathbf {x}\Vert ^2 + \sum _{j=0}^{K-1} \prod _{k=0}^{j-1}\left( 1 - \frac{\sigma _{\min }^{2}(A)}{\gamma _k}\right) \frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2\\&\le \prod _{k=0}^{K-1}\left( 1 - \frac{\sigma _{\min }^{2}(A)}{4\gamma _k}\right) \cdot \Vert \mathbf {x}_{0} - \mathbf {x}\Vert ^2 + \sum _{j=0}^{K-1}\left( 1 - \frac{\sigma _{\min }^{2}(A)}{4m}\right) ^j\frac{1}{2}\Vert \mathbf {e}\Vert _{\infty }^2\\&\le \prod _{k=0}^{K-1}\left( 1 - \frac{\sigma _{\min }^{2}(A)}{4\gamma _k}\right) \cdot \Vert \mathbf {x}_{0} - \mathbf {x}\Vert ^2 + 2m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2\\&\le \left( 1 - \frac{\sigma _{\min }^{2}(A)}{4m}\right) ^K\Vert \mathbf {x}_{0} - \mathbf {x}\Vert ^2 + 2m\sigma _{\min }^{-2}(A)\Vert \mathbf {e}\Vert _{\infty }^2, \end{aligned}$$

where the second and fourth inequalities follow from the simple bound $\gamma _k \le {m}$ and the third by bounding above by the infinite sum. The last two inequalities complete the proof of (2.2) and (2.3). $\square $

Note that Lemma 2.1 and Corollary 2.1 are true for any desired solution, $\mathbf {x}$. Here the desired solution could be the least squares solution or generally any other point. However, the residual of the desired solution, $A\mathbf {x} - \mathbf {b}$, determines the error $\mathbf {e}$ and the final error of Motzkin’s method.

We note that the convergence rate given by (2.2) yields a significant improvement over that given by (1.2) when the dynamic range of many residuals is large, i.e. when $\gamma _k \ll m$ for many iterations k. In Fig. 2, we present the convergence of Motzkin and RK on a random system which before normalization is defined by matrix $A \in {\mathbb {R}}^{5000 \times 100}$ with $a_{ij} \sim {\mathcal {N}}(1,0.5)$ and $\mathbf {b} = A{\mathbf {1}} + {\varvec{\varepsilon }}$ where ${\mathbf {1}}$ denotes the all ones vector and ${\varvec{\varepsilon }}$ is a Gaussian vector, and the corresponding theoretical bounds. Figure 3 presents plots providing the convergence of Motzkin and RK, and the corresponding theoretical bounds on systems of equations defined by problems from the Netlib linear programming benchmark set [12]. These problems contain naturally under-determined systems, which we transform into overdetermined, inconsistent systems with nearly the same least-squares solution. We transform the problem, originally given by the underdetermined systems of equations $A\mathbf {x} = \mathbf {b}$ by adding equations to form

$$\begin{aligned} \begin{bmatrix} A \\ I \end{bmatrix} \mathbf {x} = \begin{bmatrix} \mathbf {b} \\ \mathbf {x}_{LS} + {\varvec{\varepsilon }} \end{bmatrix} \end{aligned}$$

where $\mathbf {x}_{LS}$ is the least-norm solution of $A\mathbf {x} = \mathbf {b}$ and ${\varvec{\varepsilon }}$ is a Gaussian vector with small variance, and normalizing the resulting system. Each problem has very small error which is distributed relatively uniformly, thus there are many iterations in which the theoretical bounds hold. The resulting matrix for problem agg is of size $1103 \times 615$, the resulting matrix for problem agg2 is of size $1274 \times 758$, the resulting matrix for problem agg3 is also of size $1274 \times 758$, and the resulting matrix for problem bandm is of size $777 \times 472$. These plots are only for the iterations before the stopping criterion is met.

In Table 1, we include the CPU computation time (computed with the Matlab function cputime) required to reach residual norm, $\Vert A\mathbf {x}_k-\mathbf {b}\Vert _\infty $, less than $4\Vert A\mathbf {x}_{\text {LS}} -\mathbf {b}\Vert _\infty $; that is to compute $\mathbf {x}_k$ with $\Vert A\mathbf {x}_k-\mathbf {b}\Vert _\infty \le 4 \Vert A\mathbf {x}_{\text {LS}}-\mathbf {b}\Vert _\infty $. We include computation times averaged over 10 trials for Motzkin’s method and the Randomized Kaczmarz method on the Netlib problems agg, agg2, agg3, and bandm. Note that this computation is performed with no parallelization implemented for Motzkin’s method, which means that each iteration of Motzkin’s method is much more costly than that of RK. Nevertheless, Motzkin’s method outperforms RK on some of the selected Netlib problems. However, this is not the focus of this paper, as the acceleration described in Lemma 2.1 does not necessarily guarantee Motzkin’s method a computational advantage if the iterations are significantly more costly than those of RK.

This acceleration is in force until the stopping criterion given in the corollary. This bound therefore, can be used to design such stopping criteria; one could design an approach for example that utilizes the Motzkin method until reaching this threshold, and then switching to the traditional RK selection strategy to reduce the convergence horizon. In Fig. 4, we see that Motzkin outperforms RK for the initial iterations (while $\Vert A \mathbf {x}_k - \mathbf {b}\Vert _\infty \gg \Vert \mathbf {e}\Vert _\infty $) on a system with Gaussian noise. Here, before normalization, the system consists of Gaussian matrix $A \in {\mathbb {R}}^{50000 \times 100}$ and right-hand side $\mathbf {b} = A{\mathbf {1}} + \mathbf {e}$ where ${\mathbf {1}}$ is the vector of all ones and $\mathbf {e}$ is a Gaussian vector. However, for a system with sparse, large magnitude error, Motzkin does not perform as well in the long run, as it suffers from a worse convergence horizon than RK. Here, before normalization, the system consists of Gaussian matrix $A \in {\mathbb {R}}^{50{,}000 \times 100}$ and right-hand side $\mathbf {b} = A{\mathbf {1}} + 15\sum _{j \in S} \mathbf {e}_j$ where $\mathbf {e}_j$ denotes the jth coordinate vector and S is a uniform random sample of 50 indices.

Table 1 Average CPU computation times (s) required to compute iterate $\mathbf {x}_k$ with $\Vert A\mathbf {x}_k-\mathbf {b}\Vert _\infty \le 4\Vert A\mathbf {x}_{\text {LS}}-\mathbf {b}\Vert _\infty $ for the four Netlib problems, agg, agg2, agg3, and bandm. These values are averaged over 10 trials

Full size table

To capitalize on this accelerated convergence, one needs knowledge of an upper bound $\Vert \mathbf {e}\Vert _{\infty } \le \beta $, in which case the stopping criterion of $\Vert A\mathbf {x}_k - \mathbf {b}\Vert _{\infty } \le 4\beta $ guarantees the accelerated convergence of (2.2) and a final error of $\Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 \le 25m\sigma _{\min }^{-2}(A)\beta ^2$. Indeed, one quickly verifies that when $\Vert A\mathbf {x}_k - \mathbf {b}\Vert _{\infty } \le 4\beta $, we have

$$\begin{aligned} \Vert \mathbf {x}_k - \mathbf {x}\Vert&\le \sigma _{\min }^{-1}(A)\Vert A\mathbf {x}_k - A\mathbf {x}\Vert \\&\le \sqrt{m}\sigma _{\min }^{-1}(A)\Vert A\mathbf {x}_k - A\mathbf {x}\Vert _{\infty }\\&\le \sqrt{m}\sigma _{\min }^{-1}(A)\left( \Vert A\mathbf {x}_k - \mathbf {b}\Vert _{\infty } + \Vert \mathbf {e}\Vert _{\infty }\right) \\&\le \sqrt{m}\sigma _{\min }^{-1}(A)\left( 4\beta + \beta \right) \!. \end{aligned}$$

Since the acceleration of the method occurs when many of the terms $\gamma _k$ are small, we plot an example in Fig. 5. As expected, many terms are bounded away from m. We will analyze this in the Gaussian case further below.

We also only expect this acceleration to be present while the condition of Lemma 2.1 is in force (i.e. prior to the stopping condition given in the corollary). Once the condition of Lemma 2.1 is no longer satisfied, selecting greedily will select those entries of the residual which have large contribution from the error, moving the estimation far from the desired solution. While the difference between greedy selection and randomized selection is not so drastic for Gaussian noise, it will be drastically different for a sparse error. We include an example system in Fig. 6 to assist with intuition. Again, one could of course implement the Kaczmarz approach after an initial use of the Motzkin method as a strategy to gain acceleration without sacrificing convergence horizon. In Fig. 7, we present the convergence of Motzkin’s method, the RK method, and a hybrid method which consists of Motzkin iterations until $\Vert A\mathbf {x}_k - \mathbf {b}\Vert _{\infty } \le 4\Vert \mathbf {e}\Vert _\infty $, followed by RK iterations. Again we include results on both a system with Gaussian error and a system with a sparse, ‘spiky’ error, with the systems generated as in Fig. 4.

2.1 Heuristics for the Gaussian case

Here, we study heuristics for our convergence results for the Gaussian matrix case. Note that our results hold for matrices with normalized rows. For simplicity however, we will consider an $m\times n$ matrix whose entries are i.i.d. Gaussian with mean 0 and variance 1 / n. We will then assume we are in the asymptotic regime where this distribution approximates a Gaussian matrix with normalized unit-norm rows^{Footnote 2} To that end, we assume m and n both grow linearly with respect to one another, and that they are both substantially large.

Define $I_k$ to be the rows of A that are independent from $\mathbf {x}_k$ and note that $I_k \subseteq I_{k-1} \subseteq ... \subseteq I_1 \subseteq I_0 = [m]$. Fix iteration k and define $m' = m - |I_k|$. Note that $m-k \le m' \le m$ is the dimension of the sub-matrix whose rows are independent of the iterates up to iteration k. Throughout this section ${\mathbb {P}}$ and ${\mathbb {E}}$ refer to probability and expectation taken with respect to the random and unsampled portion of the matrix A, $A_{I_k}$, which has $m'$ rows.

Our first lemma gives a bound on the expected dynamic range for a Gaussian matrix.

Lemma 2.2

If $A \in {{\mathrm{{\mathbb {R}}}}}^{m \times n}$ is a Gaussian matrix with $a_{ij} \sim {\mathcal {N}}(0,1/n)$ and $\mathbf {x}$ is independent of at least $m'$ rows of A (e.g. constructed via k iterations of Motzkin’s method) then

$$\begin{aligned} \frac{{\mathbb {E}}\Vert A\mathbf {x}\Vert ^2}{{\mathbb {E}}\Vert A\mathbf {x}\Vert _\infty ^2} \lesssim \frac{n(m' + \sum _{i \not \in I_k} \Vert \mathbf {a}_i\Vert ^2)}{\log (m')}. \end{aligned}$$

Proof

First note that

$$\begin{aligned} {\mathbb {E}}(\sum _{i=1}^m (\mathbf {a}_i^T\mathbf {x})^2)&= \sum _{i=1}^m {\mathbb {E}} (\mathbf {a}_i^T\mathbf {x})^2\\&\le \sum _{i=1}^m {\mathbb {E}}(\Vert \mathbf {a}_i\Vert ^2 \Vert \mathbf {x}\Vert ^2)&\text {by Cauchy-Schwartz}\\&\le \sum _{i \in I_k} {\mathbb {E}}(\Vert \mathbf {a}_i\Vert ^2 \Vert \mathbf {x}\Vert ^2) + \sum _{i \not \in I_k} \Vert \mathbf {a}_i\Vert ^2 \Vert \mathbf {x}\Vert ^2\\&= (m' + \sum _{i \not \in I_k} \Vert \mathbf {a}_i\Vert ^2) \Vert \mathbf {x}\Vert ^2. \end{aligned}$$

Next, note that if $\mathbf {a}_i$ and $\mathbf {x}$ are independent then $\mathbf {a}_i^T\mathbf {x} \sim {\mathcal {N}}(0,\Vert \mathbf {x}\Vert ^2/n)$. Then

$$\begin{aligned} {\mathbb {E}}(\max _{i \in [m]} (\mathbf {a}_i^T\mathbf {x})^2)&\ge {\mathbb {E}}(\max _{i \in I_k} (\mathbf {a}_i^T\mathbf {x})^2)\\&\ge {\mathbb {E}}(\max _{i \in I_k} \mathbf {a}_i^T\mathbf {x})^2\\&\ge ({\mathbb {E}}\max _{i \in I_k} \mathbf {a}_i^T\mathbf {x})^2&\text {by Jensen's inequality}\\&\ge \frac{c\Vert \mathbf {x}\Vert ^2\log (m')}{n}, \end{aligned}$$

as it is commonly known that ${\mathbb {E}}(\max _{i \in [N]} X_i) \ge c \sigma \sqrt{log N}$ for $X_i \sim {\mathcal {N}}(0,\sigma ^2)$. Thus, we have

$$\begin{aligned} {\mathbb {E}}\Vert A\mathbf {x}\Vert _\infty ^2 \ge \frac{c \Vert \mathbf {x}\Vert ^2 \log (m')}{n} \ge c \frac{\log (m')}{n(m' + \sum _{i \not \in I_k}\Vert \mathbf {a}_i\Vert ^2)}{\mathbb {E}}\Vert A\mathbf {x}\Vert ^2. \end{aligned}$$

$\square $

We can use this lemma along with our main result to obtain the following.

Corollary 2.2

Let $A \in {{\mathrm{{\mathbb {R}}}}}^{m \times n}$ be a normalized Gaussian matrix as described previously, $\mathbf {x}$ denote the desired solution of the system given by matrix A and right hand side $\mathbf {b}$, write $\mathbf {e} = A\mathbf {x}-\mathbf {b}$ as the error term and assume $\mathbf {x}_0$ is chosen so that $\mathbf {x}_0 - \mathbf {x}$ is independent of the rows of A, $\mathbf {a}_i^T$. If Algorithm 1 is run with stopping criterion $\Vert A\mathbf {x}_k - \mathbf {b}\Vert _\infty \le 4\Vert \mathbf {e}\Vert _\infty $, in expectation the method exhibits the accelerated convergence rate

$$\begin{aligned} {\mathbb {E}} \Vert \mathbf {x}_{k+1}-\mathbf {x}\Vert ^2 \lesssim {\mathbb {E}}\Bigg [\Bigg (1 - \frac{\log (m') \sigma _{\min }^2(A)}{4nm}\Bigg ) \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _\infty ^2\Bigg ]. \end{aligned}$$

(2.6)

Proof

Beginning from line (2.4) of the proof of Corollary 2.1 and taking expectation of both sides, we have

$$\begin{aligned} {\mathbb {E}}\Vert \mathbf {x}_{k+1}-\mathbf {x}\Vert ^2&\le {\mathbb {E}} \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{1}{4}{\mathbb {E}} \Vert A(\mathbf {x}_k-\mathbf {x})\Vert _\infty ^2 + \frac{1}{2}{\mathbb {E}}\Vert \mathbf {e}\Vert _\infty ^2\\&\lesssim {\mathbb {E}} \Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{\log (m')}{4n(m'+\sum _{i \not \in I_k}\Vert \mathbf {a}_i\Vert ^2)}{\mathbb {E}} \Vert A(\mathbf {x}_k-\mathbf {x})\Vert ^2 + \frac{1}{2}{\mathbb {E}}\Vert \mathbf {e}\Vert _\infty ^2\\&= {\mathbb {E}} \Bigg [\Vert \mathbf {x}_k - \mathbf {x}\Vert ^2 - \frac{ \log (m')}{4nm} \Vert A\mathbf {x}_k-A\mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _\infty ^2\Bigg ]\\&\le {\mathbb {E}} \Bigg [\Bigg (1 - \frac{\log (m')\sigma _{\min }^2(A)}{4nm}\Bigg ) \Vert \mathbf {x}_k -\mathbf {x}\Vert ^2 + \frac{1}{2}\Vert \mathbf {e}\Vert _\infty ^2\Bigg ] \end{aligned}$$

where the second inequality follows from Lemma 2.2 and the fourth from properties of singular values. $\square $

This corollary implies a logarithmic improvement in the convergence rate if $n<< \log (m')$, at least initially. Of course, we conjecture that the $\log (m')$ term in (2.6) is an artifact of the proof and could actually be replaced with $\log (m)$. Additionally, we conjecture that the n in (2.6) is an artifact of the proof. This is supported by the experiments shown in Figs. 5 and 8. Before normalization, the system for the experiment plotted in Fig. 8 is defined by Gaussian $A \in {\mathbb {R}}^{50000 \times 100}$ and $\mathbf {b} = \mathbf {e}$ where $\mathbf {e}$ is a Gaussian vector. Furthermore, Corollary 5.35 of [18] provides a lower bound for the size of the smallest singular value of A with high probability, ${\mathbb {P}}\left( \sigma _{\min }(A) \le \sqrt{m/n} - 1 - t/\sqrt{n}\right) \le 2e^{-t^2/2}.$ That is, asymptotically $\sigma _{\min }(A)$ is tightly centered around $\sqrt{m/n} - 1$.

3 Conclusion

We have provided a theoretical analysis for Motzkin’s method for inconsistent systems. We show that by using such a greedy selection strategy, the method exhibits an accelerated convergence rate until a particular threshold is reached. This threshold depends on the dynamic range of the residual, and could be estimated to employ a strategy that yields acceleration without sacrificing convergence accuracy. We provide experiments and concrete analysis for Gaussian systems that support our claims. Future work includes a detailed analysis for other types of relevant systems, theoretical guarantees when estimating the residual, and the study of computational tradeoffs as in the framework of [4]. While Motzkin’s method can be more computationally expensive than RK, understanding the accelerated convergence per iteration will aid in an analysis of computational tradeoffs for the methods in [4]; in addition, it may offer significant advantages in parallel architectures. These are important directions for future work.

Notes

We use the convention that an empty sum or product equates to one.
This can be readily verified by observing that the distribution of $\mathbf {a}_i$ is rotationally invariant and thus $\left( \mathbf {a}_i^T\frac{\mathbf {x}}{\Vert \mathbf {x}\Vert }\right) ^2$ has the same distribution as $\left( \mathbf {a}_i^T\mathbf {e}_1\right) ^2$, where $\mathbf {e}_1$ is the first coordinate vector. Thus it has the same distribution as the ratio of chi-square random variables $g_1^2/\sum _{i=1}^n g_i^2$, for i.i.d. standard normal $g_i$. One then applies Slutsky’s theorem to obtain the asymptotic result.

References

Agmon, S.: The relaxation method for linear inequalities. Can. J. Math. 6, 382–392 (1954)
Article MathSciNet MATH Google Scholar
Censor, Y.: Row-action methods for huge and sparse systems and their applications. SIAM Rev. 23(4), 444–466 (1981). https://doi.org/10.1137/1023097
Article MathSciNet MATH Google Scholar
Chen, X., Powell, A.: Almost sure convergence of the Kaczmarz algorithm with random measurements. J. Fourier Anal. Appl. (2012). https://doi.org/10.1007/s00041-012-9237-2
De Loera, J.A., Haddock, J., Needell, D.: A sampling Kaczmarz–Motzkin algorithm for linear feasibility. SIAM J. Sci. Comput. 39(5), S66–S87 (2017)
Article MathSciNet MATH Google Scholar
Eldar, Y.C., Needell, D.: Acceleration of randomized Kaczmarz method via the Johnson–Lindenstrauss lemma. Numer. Algorithms 58(2), 163–177 (2011). https://doi.org/10.1007/s11075-011-9451-z. 65F20 (65F10); 2835851; Alexander N. Malyshev
Article MathSciNet MATH Google Scholar
Gower, R.M., Richtárik, P.: Randomized iterative methods for linear systems. SIAM J. Matrix Anal. Appl. 36(4), 1660–1690 (2015)
Article MathSciNet MATH Google Scholar
Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Polon. Sci. Lett. Ser. A, 335–357 (1937)
Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010). https://doi.org/10.1287/moor.1100.0456
Article MathSciNet MATH Google Scholar
Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods. SIAM J. Matrix Anal. Appl. 36(4), 1590–1604 (2015)
Article MathSciNet MATH Google Scholar
Motzkin, T.S., Schoenberg, I.J.: The relaxation method for linear inequalities. Can. J. Math. 6, 393–404 (1954)
Article MathSciNet MATH Google Scholar
Needell, D.: Randomized Kaczmarz solver for noisy linear systems. BIT Numer. Math. 50(2), 395–403 (2010)
Article MathSciNet MATH Google Scholar
Netlib: The Netlib Linear Programming Library. www.netlib.org/lp. Accessed 7 Nov 2017
Nutini, J., Sepehry, B., Virani, A., Laradji, I., Schmidt, M., Koepke, H.: Convergence rates for greedy Kaczmarz algorithms. Conference on Uncertainty in Artificial Intelligence (2016)
Petra, S., Popa, C.: Single projection Kaczmarz extended algorithms. Numer. Algorithms. (2015). https://doi.org/10.1007/s11075-016-0118-7
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144, 1–38 (2012)
Article MathSciNet MATH Google Scholar
Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Cornell Aeronaut. Lab. Psychol. Rev. 65(6), 386–408 (1958)
Google Scholar
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)
Article MathSciNet MATH Google Scholar
Vershynin, R.: In: In: Compressed Sensing, (ed.) Introduction to the non-asymptotic analysis of random matrices, pp. 210–268. University Press, Cambridge (2012)
Zouzias, A., Freris, N.M.: Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl. 34(2), 773–793 (2013)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, University of California, Davis, Davis, CA, 95616-5270, USA
Jamie Haddock
Department of Mathematics, University of California, Los Angeles, Los Angeles, CA, 90095-0001, USA
Deanna Needell

Authors

Jamie Haddock
View author publications
You can also search for this author in PubMed Google Scholar
Deanna Needell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jamie Haddock.

Additional information

Communicated by Lars Eldén.

This material is based upon work supported by the National Science Foundation under Grant No. DMS-1440140 while the authors were in residence at the Mathematical Sciences Research Institute in Berkeley, California, during the Fall 2017 semester. Jamie Haddock was also partially supported by NSF Grant DMS-1522158 and the University of California, Davis Dissertation Fellowship. Deanna Needell was also supported by NSF CAREER award $\#1348721$ and NSF BIGDATA $\#1740325$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Haddock, J., Needell, D. On Motzkin’s method for inconsistent linear systems. Bit Numer Math 59, 387–401 (2019). https://doi.org/10.1007/s10543-018-0737-6

Download citation

Received: 09 February 2018
Accepted: 03 November 2018
Published: 14 November 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s10543-018-0737-6

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On Motzkin’s method for inconsistent linear systems

Abstract

Similar content being viewed by others

Randomized Kaczmarz method with adaptive stepsizes for inconsistent linear systems

The Best Approximate Solution of the Inconsistent Linear System via a Quadratic Penalty Function

Modified inexact Levenberg–Marquardt methods for solving nonlinear least squares problems