1 Introduction

The most fundamental problem in linear algebra and numerical mathematics may be finding solutions to the linear systems

$$\begin{aligned} Ax=b, \end{aligned}$$
(1)

with \(A\in {\mathbb {R}}^{m\times n}\) and \(b\in {\mathbb {R}}^m\) being given. Among many existing iterative methods, we focus on Kaczmarz-type methods. Denote the rows of A by \(a_1^{\mathrm{T}},\ldots , a_m^{\mathrm{T}}\in {\mathbb {R}}^n\) and the entries of b by \(b_1,\ldots , b_m\in {\mathbb {R}}\); then the original Kaczmarz method (Kaczmarz 1937) selects the i-th row cyclically and iterates \(x_k\) via

$$\begin{aligned} x_{k+1}=x_k-\frac{\langle a_i,x_k\rangle -b_i}{\Vert a_i\Vert _2^2}a_i, \end{aligned}$$
(2)

which means that \(x_{k+1}\) is computed as an orthogonal projection of \(x_k\) onto the i-th hyperplane \(H(a_i,b_i):=\{x\in {\mathbb {R}}^n|~\langle a_i,x\rangle =b_i\}\). The Kaczmarz method has become more and more popular recently due to a seminal work (Strohmer and Vershynin 2009), where the first elegant convergence rate was obtained by considering its randomized variant. Specifically, the randomized Kaczmarz (RK) method selects a projection hyperplane with the probability \(p_i=\frac{\Vert a_i\Vert _2^2}{\Vert A\Vert _F^2}\) and converges to the least-norm solution \({\hat{x}}\) of \(Ax=b\) in expectation with a linear rate

$$\begin{aligned} {\mathbb {E}}\Vert x_{k+1}-{\hat{x}}\Vert _2^2\le \left( 1-\frac{1}{\kappa ^2}\right) {\mathbb {E}}\Vert x_k-{\hat{x}}\Vert _2^2, \end{aligned}$$
(3)

where \(\kappa =\frac{\Vert A\Vert _F}{\sigma _{\min }(A)}\), \(\Vert \cdot \Vert _F\) is the Frobenius norm, and \(\sigma _{\min }(A)\) is the non-zero smallest singular value of A. To find sparse solutions of \(Ax=b\) instead of the least-norm solutions, the randomized sparse Kaczmarz (RaSK) method was introduced in Lorenz et al. (2014b, 2014a) and Petra (2015) with the following scheme:

$$\begin{aligned} x^*_{k+1}&=x^*_k-\frac{\langle a_i,x_k\rangle -b_i}{\Vert a_i\Vert _2^2}a_i, \end{aligned}$$
(4a)
$$\begin{aligned} x_{k+1}&=S_{\lambda }\left( x^*_{k+1}\right) , \end{aligned}$$
(4b)

where the index i is sampled by the same rule of RK and \(S_{\lambda }(x):=\max \{|x|-\lambda ,0\}\cdot sign (x)\) is the soft shrinkage operator with parameter \(\lambda \ge 0\). It is not hard to see if the parameter \(\lambda \) is equal to zero, then the operator S reduces to the identity operator and hence RaSK reduces to RK. In this sense, RaSK generalizes RK. Theoretically, it was shown in Schöpfer and Lorenz (2019) that the sequence \(\{x_{k}\}\) generated by RaSK converges linearly in expectation to the unique solution \({\hat{x}}\) of the augmented basis pursuit problem (Schöpfer 2012; Chen et al. 2001; Elad 2010)

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n}\lambda \Vert x\Vert _{1}+\frac{1}{2}\Vert x\Vert _2^2,~ {\text{ subject } \text{ to }}~ Ax=b. \end{aligned}$$
(5)

Let \(\mathrm{supp}({\hat{x}})=\{j\in \{1,\ldots ,n\}|{\hat{x}}_j\ne 0\}\) and \(A_J\) represent the columns of A indexed by J; define

$$\begin{aligned} {\widetilde{\sigma }}_{\min }(A)=\min \{\sigma _{\min }(A_J)|J\subset \{1,\ldots ,n\},A_J\ne 0\}. \end{aligned}$$
(6)

Let \({\widetilde{\kappa }}=\frac{\Vert A\Vert _F}{{\widetilde{\sigma }}_{\min }(A)},\) and when \(b\ne 0\) we have \({\hat{x}}\ne 0\); define

$$\begin{aligned} |{\hat{x}}|_{\min }=\min \{|{\hat{x}}_j||j\in \mathrm{supp}({\hat{x}})\}>0. \end{aligned}$$
(7)

With the above notions, the linear convergence rate of RaSK is given in Schöpfer and Lorenz (2019)

$$\begin{aligned} {\mathbb {E}}\Vert x_{k+1}-{\hat{x}}\Vert _2^2\le \left( 1- \frac{1}{2}\cdot \frac{1}{{\widetilde{\kappa }}^2}\cdot \frac{|{\hat{x}}|_{\min }}{|{\hat{x}}|_{\min }+2\lambda }\right) {\mathbb {E}}\Vert x_k-{\hat{x}}\Vert _2^2. \end{aligned}$$
(8)

In view of the Kaczmarz methods, we use only a single row of the matrix in each iterate. They are widely applied to solve large-scale linear systems in various cases, for example, tensor recovery (Chen and Qin 2021; Du and Sun 2021; Wang et al. 2022), compressed sensing (Lorenz et al. 2014a), phase retrieval (Tan and Vershynin 2019) and so on. Due to the wide range of applications of the Kaczmarz methods, more advanced sampling rules have to be considered such as methods in Zouzias and Freris (2013), Jiang et al. (2020), Yuan et al. (2022a, 2022b), Steinerberger (2021) and Li and Liu (2022) to improve the convergence rate of RK or RaSK. For example, in our recent paper (Yuan et al. 2022a), we introduced a sparse sampling Kaczmarz–Motzkin method which essentially combined the random and greedy ideas (Feichtinger et al. 1992) together. Very recently, a weighted sampling rule, which selects the i-th row with likelihood proportional to \(|\langle a_i,x_k\rangle -b_i|^p/\Vert a_i\Vert _2^p\) with \(0<p<\infty \), was proposed by Jiang et al. (2020); Steinerberger (2021), respectively, to accelerate the RK method, and more importantly the new rule can be used to explain why the greedy idea (also called maximal correction method) works well. As a natural question, can we adopt the proposed weighted sampling rule to speed up the RaSK? In this study, we answer this question in an affirmative way. Actually, we will show that the RaSK method equipped with the weighted sampling rule (WRaSK) converges linearly in expectation in the sense that

$$\begin{aligned} {\mathbb {E}}\Vert x_{k}-{\hat{x}}\Vert _2^2\le \left( 1-\frac{1}{2} {\widetilde{\sigma }}^2_{\mathrm{min}}(A)\frac{|{\hat{x}}|_{\mathrm{min}}}{|{\hat{x}}|_{\mathrm{min}}+2\lambda } \mathop {\inf }\limits _{z\ne 0}\frac{\Vert Az\Vert _{l_{p+2}}^{p+2}}{\Vert Az\Vert _{l_{p}}^{p}\Vert Az\Vert _2^{2}}\right) ^{k}\left( 2\lambda \Vert {\hat{x}}\Vert _1+\Vert {\hat{x}}\Vert _2^2\right) , \end{aligned}$$
(9)

where A is normalized to \(\Vert a_i\Vert _2=1\) for each row and \({\hat{x}}\) is the unique solution of (5). By showing

$$\begin{aligned} \mathop {\inf }\limits _{z\ne 0}\frac{\Vert Az\Vert _{l_{p+2}}^{p+2}}{\Vert Az\Vert _{l_{p}}^{p}\Vert Az\Vert _2^{2}}\ge \frac{1}{m}=\frac{1}{\Vert A\Vert _F^2}, \end{aligned}$$
(10)

we can conclude that the new linear convergence rate is at least as efficient as RaSK; more details will be presented in our Theorem 1. On the other hand, WRaSK reduces to RaSK when \(p=0\) and approximates the maximal correction variant of RaSK as \(p\rightarrow \infty \); the latter also explains why our previously proposed sparse sampling Kaczmarz–Motzkin method in Yuan et al. (2022a) works well. Numerically, we demonstrate the superiority of WRaSK via a group of experiments.

The paper is organized as follows. In Sect. 2, we recall the basic knowledge about Bregman distance and Bregman projection. In Sect. 3, we propose the WRaSK method and prove its linear convergence rates in both noiseless and noisy cases. Some detailed remarks about WRaSK are discussed in Sect. 4. In Sect. 5, we apply some experiments to verify the superiority of WRaSK. Section 6 is our conclusion.

2 Preliminaries

First, we recall some basic knowledge about convex analysis (Nesterov 2003; Rockafellar and Wets 2009).

2.1 Convex analysis tool

Definition 1

Let \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be a convex function. The subdifferential of f at \(x\in {\mathbb {R}}^n\) is

$$\begin{aligned} \partial f(x):=\{x^*\in {\mathbb {R}}^n|f(y)\ge f(x)+\langle x^*,y-x\rangle , \forall y\in {\mathbb {R}}^n\}. \end{aligned}$$

If the convex function f is assumed to be differentiable, then \(\partial f(x)=\{\nabla f(x)\}.\)

Definition 2

We say that a convex function \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is \(\alpha \)-strongly convex if there exits \(\alpha >0\) such that \(\forall x,y\in {\mathbb {R}}^n\) and \(x^*\in \partial f(x)\), we have

$$\begin{aligned} f(y)\ge f(x)+\langle x^*,y-x\rangle +\frac{\alpha }{2}\Vert y-x\Vert ^2. \end{aligned}$$

The conjugate function is \(f^*(x^*)=\sup _{x\in {\mathbb {R}}^{n}}\{\langle x^*,x\rangle -f(x)\}\).

Given a convex function f, then \(f(\cdot )+\frac{\lambda }{2}\Vert \cdot \Vert _2^2\) must be \(\lambda \)-strongly convex if \(\lambda >0\).

Example 1

(Schöpfer and Lorenz 2019) For the augmented \(\ell _1\)-norm \(f(x)=\lambda \Vert x\Vert _1+\frac{1}{2}\Vert x\Vert _2^2\), its subdifferential is given by \(\partial f(x)=\{x+\lambda \cdot s|s_i=\mathrm{sign}(x_i),x_i\ne 0\) and \(s_i\in [-1,1],x_i=0\}.\) Its conjugate function is \(f^*(x)=\frac{1}{2}\Vert S_{\lambda }(x)\Vert ^2\), and moreover \(\nabla f^*(x)=S_{\lambda }(x).\)

2.2 The Bregman distance

Definition 3

(Lorenz et al. 2014b) Let \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be a convex function. The Bregman distance between \(x,y \in {\mathbb {R}}^n\) with respect to f and \(x^*\in \partial f(x)\) is defined as

$$\begin{aligned} D_f^{x^*}(x,y):=f(y)-f(x)-\langle x^*,y-x\rangle . \end{aligned}$$

Example 2

  1. (a)

    When \(f(x)=\frac{1}{2}\Vert x\Vert _2^2\), we have \(\partial f(x)=\{\nabla f(x)\}\) and \(D_f^{x^*}(x,y)=\frac{1}{2}\Vert y-x\Vert _2^2.\)

  2. (b)

    When \(f(x)=\lambda \Vert x\Vert _1+\frac{1}{2}\Vert x\Vert _2^2\), we have \(x^*=x+\lambda \cdot s\in \partial f(x)\) and \(D_f^{x^*}(x,y)=\frac{1}{2}\Vert y-x\Vert _2^2+\lambda (\Vert y\Vert _1-\langle s,y\rangle ),\) where s is defined in Example 1.

The following result will be used in the forthcoming convergence analysis of WRaSK.

Lemma 1

(Schöpfer and Lorenz 2019) Let \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be \(\alpha \)-strongly convex. Then for all \(x,y\in {\mathbb {R}}^n\) and \(x^*\in \partial f(x), y^*\in \partial f(y)\), we have

$$\begin{aligned} \frac{\alpha }{2}\Vert x-y\Vert _2^2\le D_f^{x^*}(x,y)\le \langle x^*-y^*,x-y\rangle \le \Vert x^*-y^*\Vert _2\cdot \Vert x-y\Vert _2 \end{aligned}$$

and hence \(D_f^{x^*}(x,y)=0 \Leftrightarrow x=y.\)

2.3 The Bregman projection

Definition 4

(Lorenz et al. 2014b) Let \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be \(\alpha \)-strongly convex and \(C\subset {\mathbb {R}}^n\) be a nonempty closed convex set. The Bregman projection of x onto C with respect to f and \( x^*\in \partial f(x)\) is the unique point \(\varPi _{C}^{x^*}(x)\in C\) such that

$$\begin{aligned} D^{x^*}_f(x,\varPi _{C}^{x^*}(x))=\min _{y\in C}D^{x^*}_f(x,y). \end{aligned}$$

The Bregman projection generalizes the traditional orthogonal projection. Note that when \(f(x)=\frac{1}{2}\Vert x\Vert _2^2\), the Bregman projection reduces to the orthogonal projection, denoted by \(P_{C}(x)\) as usual. The next lemma tells us how to compute the Bregman projection onto affine subspaces.

Lemma 2

(Lorenz et al. 2014b) Let \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be \(\alpha \)-strongly convex, \(a\in {\mathbb {R}}^n\), \(\beta \in {\mathbb {R}}.\) The Bregman projection of \(x\in {\mathbb {R}}^n\) onto the hyperplane \(H(a,\beta )=\{x\in {\mathbb {R}}^n|\langle a,x\rangle =\beta \}\) with \(a\ne 0\) is

$$\begin{aligned} z:=\varPi _{H(a,\beta )}^{x^*}(x)=\nabla f^*(x^*-a{\hat{t}}), \end{aligned}$$

where \({\hat{t}}\in {\mathbb {R}}\) is a solution of \(\min _{t\in {\mathbb {R}}}f^*(x^*-at)+t\beta .\) Moreover, \(z^*=x^*-{\hat{t}}a\) is an admissible subgradient for z and for all \(y\in H(a,\beta )\) we have

$$\begin{aligned} D^{z^*}_f(z,y)\le D^{x^*}_f(x,y)-\frac{\alpha }{2}\frac{(\langle a,x\rangle -\beta )^2}{\Vert a\Vert _2^2}. \end{aligned}$$
(11)

3 Weighted randomized sparse Kaczmarz method

In this section, we will introduce the WRaSK method to solve the augmented basis pursuit problem (5). We assume that \(b\ne 0\) belongs to \(\mathcal {R}(A)\) so that (5) has a unique solution \({\hat{x}}\ne 0\). Without loss of generality, we normalize each row of the matrix A to \(\Vert a_i\Vert _2=1\).

3.1 The sampling rule

Recall that we aim to adopt the weighted sampling rule to RaSK; so let us first introduce what this rule is. Different from the uniform sampling in RK and RaSK (note that \(\Vert a_i\Vert _2=1\)), the weighted sampling rule gives the rows with large residuals a greater probability. To achieve it mathematically, we start with a given point \(x_k\) and then compute the residuals \(|\langle a_i,x_k\rangle -b_i|,i=1,\ldots ,m\). Since a small residual \(|\langle a_i,x_k\rangle -b_i|\) means that \(x_k\) approximately solves the equation \(\langle a_i,x\rangle =b_i\), so we should try to correct the equations with large residuals. The weighted sampling rule selects the i-th row with likelihood proportional to \(|\langle a_i,x_k\rangle -b_i|^p\) with \(0<p<\infty \). Hence, the rows with greater residual are more possible to be selected. It should be noted that the weighted sampling rule shares a similar greedy idea with the maximal correction method, which selects the row with the largest residual in a determined rather than random way.

With the discussion above, WRaSK can be simply obtained by using the scheme (4) with the index i being chosen with probability \(p_i=\frac{|\langle a_i,x_k\rangle -b_i|^p}{\Vert Ax_k-b\Vert _{l_p}^p}.\)

3.2 The WRaSK method

In this part, we formally introduce the weighted randomized sparse Kaczmarz (WRaSK) method, which combines the randomized sparse Kaczmarz method and the weighted sampling method. It follows that WRaSK inherits their advantages. Note that the iterate \(x_k\) generated by WRaSK is projected to the selected hyperplane by using Bregman projection instead of orthogonal projection, which applies the superiority of Bregman distance to obtain sparse solutions of linear systems by using the augmented \(l_1\) norm function. The WRaSK method has two cases, that is, inexact step and exact step, which are abbreviated as WRaSK and EWRaSK, respectively. In the case of the inexact step, \(t_k=\langle a_{i_k},x_k\rangle -b_{i_k}\), which can be seen as a relaxation of Bregman projection. On the other hand, in the case of the exact step, we need to solve the piecewise quadratic function minimization problem

$$\begin{aligned} t_k=\arg \min _{t\in {\mathbb {R}}}\frac{1}{2}\Vert S_{\lambda }(x^*-a_{i_k}t)\Vert _2^2+tb_{i_k}. \end{aligned}$$
(12)

We remark that the computational complexity of WRaSK is less than that of EWRaSK due to the different step-size selections; this point will be further explained in Sect. 4.1. It can be served as the stopping criterion that the maximum iteration step or allowable tolerance error is reached.

figure a

Remark 1

Consider two extreme cases of parameter p in the weighted sampling rule.

  1. (a)

    In the case of \(p=0\), we have \(p_i=\frac{1}{m},\) which is just the uniform sampling, and hence WRaSK reduces to RaSK under the condition that matrix A is normalized by row.

  2. (b)

    As \(p\rightarrow \infty \), we have \(i_k=\arg \max _{i=1,\ldots ,m}\{|\langle a_i,x_k\rangle -b_i|\},\) which is the maximal correction method in Feichtinger et al. (1992). In this case, WRaSK is a sparse variant of partially randomized Kaczmarz (PRK) in Jiang et al. (2020).

3.3 Convergence analysis of WRaSK methods

The following lemma establishes a relationship between \(D_f^{x^*}(x,{\hat{x}})\) and \(\Vert Ax-b\Vert _2^2\).

Lemma 3

(Schöpfer and Lorenz 2019) Let \({\widetilde{\sigma }}_{\mathrm{min}}(A)\) and \(|{\hat{x}}|_{\mathrm{min}}\) be given in (6), (7) respectively. Then for all \(x\in {\mathbb {R}}^n\) with \(\partial f(x)\cap \mathcal {R}(A^{\mathrm{T}})\ne \emptyset \) and all \(x^*=A^{\mathrm{T}}y\in \partial f(x)\cap \mathcal {R}(A^{\mathrm{T}}),\) we have

$$\begin{aligned} D_f^{x^*}(x,{\hat{x}})\le \frac{1}{{\widetilde{\sigma }}^2_{\mathrm{min}}(A)}\cdot \frac{|{\hat{x}}|_{\mathrm{min}}+2\lambda }{|{\hat{x}}|_{\mathrm{min}}}\cdot \Vert Ax-b\Vert _2^2. \end{aligned}$$

Now, we proceed to show the convergence of WRaSK in noiseless and noisy cases, respectively.

Theorem 1

(Noiseless case) Let \(0< p< \infty \) and suppose that A is normalized to \(\Vert a_i\Vert _2=1\) for each row. The sequences \(\{x_k\}\) generated by Algorithm 1 converge linearly in expectation to the unique solution \({\hat{x}}\) of the regularized basis pursuit problem (5) in the sense that

$$\begin{aligned} {\mathbb {E}}(D_f^{x_{k+1}^*}(x_{k+1},{\hat{x}}))\le q\cdot {\mathbb {E}}\left( D_f^{x_{k}^*}(x_{k},{\hat{x}})\right) , \end{aligned}$$

and

$$\begin{aligned} {\mathbb {E}}\Vert x_{k}-{\hat{x}}\Vert _2\le q^{\frac{k}{2}}\sqrt{2\lambda \Vert {\hat{x}}\Vert _1+\Vert {\hat{x}}\Vert _2^2}, \end{aligned}$$

where the convergence factor is \(q=1-\frac{1}{2}\cdot {\widetilde{\sigma }}_{\min }^2(A)\cdot \frac{|{\hat{x}}|_{\min }}{|{\hat{x}}|_{\min }+2\lambda }\cdot \mathop {\inf }\limits _{z\ne 0}\frac{\Vert Az\Vert _{l_{p+2}}^{p+2}}{\Vert Az\Vert _{l_{p}}^{p}\Vert Az\Vert _2^{2}}\), and \(z=x-{\hat{x}}\).

Moreover, WRaSK is at least as efficient as RaSK due to

$$\begin{aligned} \mathop {\inf }\limits _{z\ne 0}\frac{\Vert Az\Vert _{l_{p+2}}^{p+2}}{\Vert Az\Vert _{l_{p}}^{p}\Vert Az\Vert _2^{2}}\ge \frac{1}{m}, \end{aligned}$$

where the minimum \(\frac{1}{m}\) is attainable iff Az is the constant multiple of the unit vector.

Since the proof of Theorem 1 is involved, it will be given in Appendix A.

Remark 2

We make the following remarks:

  1. (a)

    WRaSK is at least as efficient as RaSK, independent of the value of p.

  2. (b)

    The convergence factor \(q=1-\frac{1}{2}\cdot {\widetilde{\sigma }}_{\min }^2(A)\cdot \frac{|{\hat{x}}|_{\min }}{|{\hat{x}}|_{\min }+2\lambda }\cdot \mathop {\inf }\limits _{z\ne 0}\frac{\Vert Az\Vert _{l_{p+2}}^{p+2}}{\Vert Az\Vert _{l_{p}}^{p}\Vert Az\Vert _2^{2}}\) depends on \({\hat{x}}\),\(\lambda \), \({\widetilde{\sigma }}_{\min }(A)\) and p. If A has full column rank, then \({\widetilde{\sigma }}_{\min }(A)=\sigma _{\min }(A)\). If \(\lambda =0\), then the reliance on \({\hat{x}}\) disappears.

  3. (c)

    When \(|a_i^{\mathrm{T}}x_k-b_i|,i=1,\ldots ,m\) are equal, the sampling rules of WRaSK and RaSK are the same, in which case the convergence rate of WRaSK is the same as that of RaSK.

In the following, we take the noisy case into account.

Theorem 2

(Noisy case) Assume that a noisy observed data \(b^{\delta }\in {\mathbb {R}}^m\) with \(\Vert b^{\delta }-b\Vert _{l_{p+2}}\le \delta \) is given, where \(b=A{\hat{x}}\). Let the sequence \(\{x_k\}\) be generated by WRaSK or EWRaSK with b replaced by \(b^\delta \). Then with the same convergence factor q in Theorem 3.1 as in the noiseless case, we have (a) for the WRaSK method:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert x_k-{\hat{x}}\Vert _2]&\le q^{\frac{k}{2}}\sqrt{2\lambda \Vert {\hat{x}}\Vert _1+\Vert {\hat{x}}\Vert _2^2}+\delta \sqrt{\frac{c^pq}{1-q}}. \end{aligned} \end{aligned}$$

(b) For the EWRaSK method:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert x_k-{\hat{x}}\Vert _2]&\le q^{\frac{k}{2}}\sqrt{2\lambda \Vert {\hat{x}}\Vert _1+\Vert {\hat{x}}\Vert _2^2}+\delta \sqrt{\left( 1+\frac{4\lambda \Vert A\Vert _{1,p+2}}{\delta }\right) \cdot \frac{c^pq}{1-q}}, \end{aligned} \end{aligned}$$

where constant \(c\in {\mathbb {R}}\) characterizes the equivalence of vector norms \(\Vert \cdot \Vert _p\) and \(\Vert \cdot \Vert _{l_{p+2}}\).

The proof of Theorem 2 will be given in detail in Appendix B.

4 Some remarks about WRaSK

In this section, we would make some noteworthy comments about WRaSK. First, WRaSK is at least as efficient as RaSK in terms of convergence rate, and hence we should compare the time complexity of them. Second, exploring the effect of parameter p on WRaSK is an interesting problem. Finally, given that WRaSK needs to compute all residuals in each iterate, a partially weighted randomized sparse Kaczmarz is proposed to face the disadvantage of WRaSK.

4.1 Time complexity of RaSK and WRaSK

The main difference between WRaSK and RaSK is the selection of sampling rules. In this part, we compare them in terms of time complexity. Their main time complexity in each iteration comes from sampling indices and updating the iterative sequences \(x_{k}\). Note that the exact step, obtained by computing the optimal solution of (12), needs the \(O(n\log (n))\)-sorting procedure. According to Table 1, we conclude that the time complexities of WRaSK and RaSK are at the same level. Moreover, we have that the time complexity of WRaSK is at the same level as EWRaSK when \(m>\log (n)\); otherwise, the EWRaSK will cost more time than WRaSK in each iterate.

Table 1 Comparisons of Kaczmarz-type methods about time complexity

4.2 The effect of parameter p on the convergence rate

Note that the convergence rate of WRaSK is decided by

$$\begin{aligned} \frac{\Vert Az\Vert _{l_{p+2}}^{p+2}}{\Vert Az\Vert _{l_{p}}^{p}\Vert Az\Vert _2^{2}}, \end{aligned}$$

whose monotonicity on p was observed by the author in Steinerberger (2021). Here, we provide a detailed proof. We view it as the function of parameter p. Denote \(Az=g\), since the numerator and denominator of the fraction are both multiplied by a positive constant concerning g, the fraction does not change we can assume without loss of generality that each non-zero coordinate satisfying \(|g_i|\ge 1\). Ignoring the constant value \(\Vert g\Vert _2^2\) and denoting \(\log |g_i|=d_i\ge 0\), we can reformulate

$$\begin{aligned} \frac{\Vert Az\Vert _{l_{p+2}}^{p+2}}{\Vert Az\Vert _{l_{p}}^{p}}= \frac{\sum _{i=1}^{n}e^{d_i(p+2)}}{\sum _{i=1}^{n}e^{d_ip}}. \end{aligned}$$

Now, we prove that the function increases as p increases.

Lemma 4

If \(f(x)=\frac{\sum _{i=1}^{n}e^{d_i(x+2)}}{\sum _{i=1}^{n}e^{d_ix}}\), where \(d_i\ge 0\), \(0<x<\infty ,\) then f(x) is a monotonic increasing function.

The proof will be given in Appendix C. Lemma 4 verifies that the convergence rate of WRaSK increases as p increases, and hence the increment of parameter p represents the trend of constant improvement of WRaSK. Moreover, the existence of parameter p of weighted sampling rule connects the randomized sampling rule (\(p=0\)) and greedy sampling rule (\(p\rightarrow \) \(\infty \)). In view of the fact that there is no need to pursue the optimal parameter p,  we will give an empirical value of p by developing numerical experiments in Sect. 5.

4.3 The partially weighted randomized sparse Kaczmarz method

In our proposed WRaSK, we have to use all information about the linear system to construct the sampling rule. Apparently, the disadvantage of WRaSK is unfavorable for large-scale linear systems, while Kaczmarz has advantages on small storage space and computation in each iterate. Hence, we should consider how to deal with this problem by reducing the number of required residuals in each iterate.

Guided by the idea of selecting larger residuals, a randomized Kaczmarz with a partially weighted selection step was proposed in Groß (2021). It constantly compares the residuals of the sampling indices and leaves a larger residual. To overcome the weakness of WRaSK, we provide a possible solution, which is a combination of randomized sparse Kaczmarz and partially weighted selection. We propose the partially weighted randomized sparse Kaczmarz (PWRaSK) method. The pseudocode is as follows.

Denote the absolute residual of the i-th index in the k-th iterate as \(r_k(i),1\le i\le m\). Note that the main idea of steps 5-8 in PWRaSK is to look for find a relatively larger residual under less computation of residuals in each iterate. The sampling method samples indices consistently and then compares the new residual with the candidate residual. If the new residual is smaller, the candidate is used as an iterative index; otherwise, the new is used as the candidate to continue. Similar to the convergence rate of the algorithm proposed by Groß (2021), the convergence rate of PWRaSK can be generalized from randomized sparse Kaczmarz (RaSK) (Lorenz et al. 2014b; Patel et al. 2021). Moreover, it is expected that PWRaSK converges faster than RaSK, as its sampling rule approaches the maximum residual method (Feichtinger et al. 1992) or partially randomized Kaczmarz (Jiang et al. 2020).

figure b

5 Numerical experiments

In this section, we will discuss the effectiveness of WRaSK through experiments. In Sect. 5.1, we analyze the effect of parameter p on WRaSK. In Sect. 5.2, we solve linear systems with the coefficient matrix \(A\in {\mathbb {R}}^{m\times n}\) being generated by MATLAB function ‘randn’ or chosen from SuiteSparse Matrix Collection in Davis and Hu (2011).

Without loss of generality, we normalize A by row. Moreover, \({\hat{x}}\in {\mathbb {R}}^n\) is created by using the MATLAB function ‘sparserandn’ and randomly choosing the nonzero location by sparsity. The exact data \(b\in {\mathbb {R}}^m\) is calculated by \(A{\hat{x}}=b\) in the noiseless case, while we use ‘randn’ to generate the noise r and then obtain noisy data \(b^{\delta } =b+r\). Our experiments are initialized with \(x_0=x_0^*=0\) and carry out 60 trials to ensure accuracy. The main outputs are relative residual and relative error, defined, respectively, by

$$\begin{aligned} \mathrm{Residaul} =\frac{\Vert Ax-b\Vert _2}{\Vert b\Vert _2}, \quad \mathrm{Error}=\frac{\Vert x-{\hat{x}}\Vert _2}{\Vert {\hat{x}}\Vert _2}. \end{aligned}$$

All experiments are performed with MATLAB (version R2021b) on a personal computer with 2.80-GHZ CPU(Intel(R) Core(TM) i7-1165G7), 16-GB memory, and Windows operating system (Windows 10).

5.1 Parameter tuning

In this test, numerical experiments are developed to verify Lemma 4. Set \(n=200\), \(m=400\), \(\hbox {sparsity} =25\), and then select p from \(1,\frac{m}{80},\frac{m}{40},\infty \) in WRaSK. Moreover, matrix \(A\in {\mathbb {R}}^{m\times n}\) is generated by MATLAB function ‘randn’. To explore the performance of WRaSK with different values of p, we run 60 times to compute the median of relative residuals and errors from RaSK (\(p=0\)), WRaSK with different parameters p and maximal correction method (\(p\rightarrow \infty \)).

According to Fig. 1, we observe that WRaSK performs better as p increases and finally approximates the maximal correction method, which confirms Lemma 4. Since we do not pay attention to the optimal value of parameter p, and \(p=\frac{m}{40}\) is good enough according to the experiment, we take \(p=\frac{m}{40}\) for WRaSK as the empirical value in the following experiments even though it is not the best value.

Fig. 1
figure 1

Compare the residuals and errors of WRaSK with different parameters p, i.e., RaSK (black), \(p=1\) (blue), \(p=m/80\) (red), \(p=m/40\) (green), and \(p\rightarrow \infty \) (orange), \(n=200,m=400,\hbox {sparsity}=25,\lambda =1\). Left: plots of relative residual \(\Vert Ax-b\Vert _2/\Vert b\Vert _2\), right: plots of relative error \(\frac{\Vert x- \hat{x}\Vert _2}{\Vert \hat{x}\Vert _2}\). The line shows the median over 60 trials (color figure online)

5.2 Comparisons of different Kaczmarz variants

Here, we explore the performance of WRaSK in terms of the number of iteration steps (IT) and the computing time in seconds (CPU). We perform 60 numerical experiments and record IT and CPU when relative error reaches the given accuracy or the maximum number of iteration steps is achieved. Then we take the median of CPU and IT separately in 60 records.

5.2.1 Simulated data

To compare RaSK, WRK and WRaSK, we construct coefficient matrix A by MATLAB ‘randn’. Set \(p=\frac{m}{40}, \hbox {sparsity}=20,\lambda =1\). Moreover, the iterative processing determines once the relative error satisfies \(\hbox {Error}<10^{-3}\) in the noiseless case and \(\hbox {Error}<10^{-1}\) in the noisy case, or the maximum number of iteration steps reaches 200,000. If the number of iteration steps achieves 200,000, we denote it as “–”. The experimental results are recorded in Table 2.

Table 2 IT and CPU of Kaczmarz-type methods for matrices A generated from ‘randn’

According to Table 2, we find that whether there is noise or not, EWRaSK has the best performance in overdetermined cases, while RaSK outperforms other methods in underdetermined cases. It requires the least CPU time and iteration steps to satisfy the stopping rules. Note that WRK has the worst performance in noisy or underdetermined cases, where it fails to converge.

5.2.2 Real data

In this subsection, the performance of WRaSK, PWRaSK and WSSKM will be verified under some real matrices. The coefficient matrices A come from the SuiteSparse Matrix Collection (Davis and Hu 2011), which originated in different applications such as the least squares problem, 2D/3D problem, optimal control problem, and so on.

In this test, we generate ground truth \({\hat{x}}\) with \(\hbox {sparsity}=20\) and compute \(A{\hat{x}}\) as observed data b, and the datails of the chosen matrices are presented in Table 3. Set \(p=\frac{m}{40}\). We perform ERaSK, WRK and EWRaSK on these problems, and determine algorithms once the relative error reaches \(10^{-3}\) or the maximum iteration attains 200,000. The IT and CPU of the median of 60 trials in different settings are listed in Table 3.

Table 3 IT and CPU of Kaczmarz-type methods for matrices A from real data

According to Table 3, we obtain that EWRaSK performs well in terms of IT and CPU in total. Although the sampling rule of WRaSK costs more time in each iterate, the overall time of WRaSK is less than the others.

6 Conclusion

In this paper, we propose the WRaSK method for finding sparse solutions of consistent linear systems, along with a detailed analysis of convergence rate in both noiseless and noisy cases. WRaSK, as a variant of RaSK, combines the advantages of RaSK and the weighted sampling method. The theoretical results show that WRaSK is at least as efficient as RaSK. To overcome the cost calculation of WRaSK, we provide a possible solution: PWRaSK, which reduces the calculation of residuals used in each iterate. Numerical experiments also demonstrate the superiority of WRaSK.

As future work, we wonder whether it is possible to extend our study to other recently proposed sparse Kaczmarz methods.