1 Introduction

Focus on the numerical algorithms for linear complementarity problems (LCP). The definition of LCP is as follows: given matrix \(M\in {\mathbb {R}}^{n\times n}\) and vector \(q\in {\mathbb {R}}^{n}\), finding \(z\in {\mathbb {R}}^{n}\), satisfying the following constraints

$$\begin{aligned} r=Mz+q\ge 0,z\ge 0,~\text {and}~z^{T}r=0, \end{aligned}$$

where for two \(s\times t\) matrices \(K=(k_{ij})\) and \(T=(t_{ij})\) the order \(K\ge (>)T\) means \(k_{ij}\ge (>)t_{ij}\) for any i and j. The LCP has extremely wide applications in fields such as contact problems, network equilibrium problems, free boundary problems and so on; see Cottle and Pang (1992), Murty (1988) and the references therein.

The modulus-based matrix splitting (MMS) iterative method constructed based on LCP’s equivalent modulus equation by introducing matrix splitting technology had become a research hotspot in recent years. In 2010, Bai first proposed the MMS iteration method in Bai (2010), which involves equivalently transforming LCP into the following modulus equations:

$$\begin{aligned} (\Omega +M)x=(\Omega -M)|x|-\gamma q, \end{aligned}$$
(1)

where \(\Omega \) is a positive diagonal matrix, \(\gamma >0\) and the absolute operation is taken componentwise. Once the solution \(x\in {\mathbb {R}}^{n}\) of (1) is obtained, the solution of LCP is computed by \(z=\frac{1}{\gamma }(x+|x|)\). The type of MMS iteration methods is more convenient and efficient than the projected relaxation (Cryer 1971) and the modified modulus methods (Dong and Jiang 2009). Furthermore, researchers had promoted and accelerated the MMS method in different ways, such as Fang (2022), Fang et al. (2023), Fang and Zhu (2019), Huang and Ma (2018), Ke and Ma (2014), Ke et al. (2018), Ren et al. (2019), Song et al. (2022), Wu et al. (2018), Zhang (2011), Zheng et al. (2021), Zheng et al. (2017), Zheng et al. (xxxx), Zheng and Vong (2021), Zheng et al. (2019). On the other hand, by high-performance computers, the parallel methods had been studied by introducing synchronous multisplitting techniques into (1). In Bai and Zhang (2013b), the modulus-based synchronous multisplitting (MSM) iteration method was first introduced. The idea of the MSM method is to solve a simple linear equation system by evenly distributing workload, combining different matrix splittings at every iteration in each worker. By combining different acceleration techniques, the MSM method had also been improved in subsequent researches, such as Bai and Zhang (2013a), Wu and Li (2019), Xu et al. (2020), Zhang (2014), Zhang (2015).

Among numerous acceleration technologies of MMS and MSM, the two-step method is an important one, whose idea is to use two different splittings of the system matrix M to construct two linear systems which need to be solved in each iteration. The advantage of this approach is that it can fully utilize the elements’ information of the system matrix to achieve acceleration. The two-step methods for solving LCP can be mainly divided into two categories: serial and parallel, see Zhang (2011) and Zhang (2015) for details, respectively. The two-step method also has important applications in the iterative solution of linear equations. In a recent literature (Bai xxxx), Bai had discussed in detail the the two-step MMS paradigm for linear equations, where a TMSI paragigm was introduced, which can be seen as a result of combining the techniques of two-step splitting and relaxation.

In this paper, we further focuses on the acceleration of two-step parallel methods for solving LCP. Inspired by the TMSI paradigm in Bai (xxxx), we will introduce relaxed technology on the two-step MSM (TMSM) given in Zhang (2015) and construct a relaxed two-step parallel method in Sect. 2. Next, we provide convergence analysis of the proposed method to obtain the selection range of relaxed parameters in Sect. 3, where the given convergence results can generalize and improve the corresponding ones in Zhang (2015). In Sect. 4, numerical examples are given to show that the proposed method can converge faster than the existing methods. Finally, the conclusions are presented in the last section.

Next, we provide some notations, definitions, and results that need to be used in subsequent discussions.

Use e to denote an \(n\times 1\) vector whose entries are all equal to 1 and \(\textrm{diag}(x)\) to denote an \(n\times n\) diagonal matrix whose main diagonal is \(x\in {\mathbb {R}}^n\). Let \(M=(m_{ij})\in {\mathbb {R}}^{n\times n}\). The absolute of M is denoted as \(|M|=(|m_{ij}|)\). The comparison matrix of M is denoted as \(\langle M\rangle =(m_{ij}^{\prime })\), defined as

$$\begin{aligned} m_{ij}^{\prime }=\left\{ \begin{array}{c} |m_{ii}|,\ \ \text {if }i=j, \\ -|m_{ij}|,\ \ \text {if }i\ne j. \end{array} \right. \end{aligned}$$

The spectral radius of M is denoted as \(\rho (M)\). The diagonal matrix formed by the diagonal elements of M is denoted as \(D_M\), the upper triangular matrix formed by the strictly upper triangular part is denoted as \(-U_M\), and the lower triangular matrix formed by the strictly lower triangular part is denoted as \(-L_M\). Note that we have \(M=D_M-L_M-U_M\). The following are the definitions of some special matrices and the definitions of matrix splittings (Bai 1999; Berman and Plemmons 1994; Frommer and Szyld 1992):

  • if \(m_{ij}\le 0\) for \(i\ne j\), M is called a Z-matrix;

  • if M is a nonsingular Z-matrix and \(M^{-1}\ge 0\), M is called a nonsingular M-matrix;

  • if \(|m_{ii}|>\sum \nolimits _{j\ne i}|m_{ij}|\) for any \(i\in \{1,\ldots ,n\}\), M is called a strictly diagonal dominant (s.d.d.) matrix;

  • if \(\langle M\rangle \) is a nonsingular M-matrix, M is called an H-matrix;

  • if M is an H-matrix with positive diagonal entries, M is called an \(H_+\)-matrix;

  • if F is nonsingular, \(M=F-G\) is called a splitting of M;

  • if \(\langle F\rangle -|G|\) is a nonsingular M-matrix, \(M=F-G\) is called an H-splitting of M;

  • if \(\langle F\rangle -|G|=\langle M\rangle \), \(M=F-G\) is called an H-compatible splitting of M;

  • if \(M=F_i-G_i\) is a splitting of M for \(1\le i\le \ell \), \(\{F_i,G_i,E_i\}_{i=1}^\ell \) is called a multisplitting of M, where \(E_i\in {\mathbb {R}}^{n\times n}\) is a nonnegative diagonal weight matrix satisfying \(\sum \nolimits _{i=1}^\ell E_i=I\);

  • If \(M=F_i^{(1)}-G_i^{(1)}=F_i^{(2)}-G_i^{(2)}\) are two splitting of M for \(1\le i\le \ell \), \(\{F_i^{(1)},G_i^{(1)},F_i^{(2)},G_i^{(2)},E_i\}_{i=1}^\ell \) is called a two-step multisplitting of M, where the definition of \(E_i\) is the same as the one given above.

2 New method

We review the TMSM method given in Zhang (2015) first. Let \(M=F^{(1)}-G^{(1)}=F^{(2)}-G^{(2)}\) be two splittings of M. Then, the LCP can be equivalently transformed into a system of fixed-point equations

$$\begin{aligned} \left\{ \begin{array}{l} (\Omega +F^{(1)})x=G^{(1)}x+(\Omega -M)|x|-\gamma q, \\ (\Omega +F^{(2)})x=G^{(2)}x+(\Omega -M)|x|-\gamma q, \end{array} \right. \end{aligned}$$
(2)

where \(\Omega \) is a positive diagonal parameter matrix and \(\gamma \) is a positive constant; see Zhang (2011) for more details. Let \(\{F_i^{(1)},G_i^{(1)},F_i^{(2)},G_i^{(2)},E_i\}_{i=1}^\ell \) be a two-step multisplittings of M. Based on (2), in order to reduce communication cost between processors in a high-performance computer environment with multiple processors and make full use of the information of previous iteration, the TMSM iteration method was proposed in Zhang (2015) as follows:

Method 1

(Zhang 2015) Given \(x^{(0)}\in {\mathbb {R}}^{n}\) and \(\epsilon >0\), for \(i=1,2,\ldots ,\ell \), compute \(x^{(k+1,i)}\in {\mathbb {R}}^{n}\) by

$$\begin{aligned} \left\{ \begin{array}{l} (\Omega +F_i^{(1)})x^{(k+\frac{1}{2},i)}=G_i^{(1)}x^{(k)}+(\Omega -M)|x^{(k)}|-\gamma q, \\ (\Omega +F_i^{(2)})x^{(k+1,i)}=G_i^{(2)}x^{(k+\frac{1}{2},i)}+(\Omega -M)|x^{(k+\frac{1}{2},i)}|-\gamma q. \end{array} \right. \end{aligned}$$

Then, calculate the weight combination of the updates in \(\ell \) processors

$$\begin{aligned} x^{(k+1)}=\sum \limits _{i=1}^{\ell }{E_ix^{(k+1,i)}} ~~\text {and}~~ z^{(k+1)}=\frac{1}{\gamma }(|x^{(k+1)}|+x^{(k+1)}) \end{aligned}$$

for \(k=0,1,2,\ldots \) until \(\Vert \min \{z^{(k+1)},Mz^{(k+1)}+q\}\Vert <\epsilon \).

To obtain fast convergence, in the ith worker, consider mixing the local update \(x^{(k+1,i)}\) with the old approach vector \(x^{(k)}\) before the combination. By introducing relaxed parameters, we proposed the relaxed TMSM method as follows:

Method 2

Given an initial vector \(x^{(0)}\in {\mathbb {R}}^{n}\) and \(\epsilon >0\), for \(i=1,2,\ldots ,\ell \), compute \(x^{(k+1,i)}\in {\mathbb {R}}^{n}\) by

$$\begin{aligned} \left\{ \begin{array}{rll} (\Omega +F_i^{(1)})x^{(k+\frac{1}{3},i)}&{}=&{}G_i^{(1)}x^{(k)}+(\Omega -M)|x^{(k)}|-\gamma q, \\ (\Omega +F_i^{(2)})x^{(k+\frac{2}{3},i)}&{}=&{}G_i^{(2)}x^{(k+\frac{1}{3},i)}+(\Omega -M)|x^{(k+\frac{1}{3},i)}|-\gamma q,\\ x^{(k+1,i)}&{}=&{}\lambda ^{(k,i)} x^{(k+\frac{2}{3},i)}+(1-\lambda ^{(k,i)})x^{(k)}, \end{array} \right. \end{aligned}$$
(3)

where \(\lambda ^{(k,i)}\) is a scalar. Then, calculate the weight combination of the updates in \(\ell \) processors

$$\begin{aligned} x^{(k+1)}=\sum \limits _{i=1}^{\ell }{E_ix^{(k+1,i)}} ~~\text {and}~~ z^{(k+1)}=\frac{1}{\gamma }(|x^{(k+1)}|+x^{(k+1)}) \end{aligned}$$

for \(k=0,1,2,\ldots \) until \(\Vert \min \{z^{(k+1)},Mz^{(k+1)}+q\}\Vert <\epsilon \).

For Method 2, we have the next comments.

  • If we take \(\lambda ^{(k,i)}\equiv 1\), then Method 2 reduces to Method 1, which implies that the scope of the TMSM method is extended. Furthermore, one may chose some relaxed parameters \(\lambda ^{(k,i)}\) to make Method 2 converge faster than Method 1.

  • In applications, one can specially chose the splittings as those in Zhang (2015) to obtain the classes of relaxed symmetric MSM accelerated overrelaxation (RSMSMAOR), MSM overrelaxation (RSMSMSOR) and MSM Gauss-Seidel (RSMSMGS) iteration methods.

3 Convergence analysis

Some useful lemmas are presented first.

Lemma 3

(Frommer and Mayer 1989) If M is an H-matrix, \(|M^{-1}|\le \langle {M}\rangle ^{-1}\).

Lemma 4

(Hu 1982) Let \(F\in {{\mathbb {R}}^{n\times n}}\) be an s.d.d. matrix. Then, \(\forall G\in {{\mathbb {R}}^{n\times n}}\),

$$\begin{aligned} \Vert {F^{-1}G}\Vert _\infty \le {\max \limits _{1\le j\le n}\frac{(|G|e)_j}{(\langle {F}\rangle e)_j}}. \end{aligned}$$

Lemma 5

(Berman and Plemmons 1994) Let M be a Z-matrix. Then, the following statements are equivalent:

  • M is a nonsingular M-matrix;

  • There exists a positive diagonal matrix \(\Pi \), such that \(M\Pi \) is an s.d.d. matrix with positive diagonal entries;

  • For any splitting \(M=F-G\) satisfying \(F^{-1}\ge 0\) and \(G\ge 0\), it holds \(\rho (F^{-1}G)<1\).

Next, the main convergence theorem of the Method 2 is given.

Theorem 6

For \(i=1,2,\ldots ,\ell \), assume that:

\(\mathbf{(I)}\):

\(A=F_i^{(1)}-G_i^{(1)}=F_i^{(2)}-G_i^{(2)}\) are H-splittings of M;

\(\mathbf{(II)}\):

there exists a positive diagonal matrix \(\Pi \) such that \(\langle M\rangle \Pi \), \((\langle F_i^{(1)}\rangle -|G_i^{(1)}|)\Pi \) and \((\langle F_i^{(2)}\rangle -|G_i^{(2)}|)\Pi \) are s.d.d. matrices.

Then, Method 2 is convergent for any initial vector \(x^{(0)}\in {\mathbb {R}}^{n}\) provided

$$\begin{aligned} \Omega e>D_Me-\frac{1}{2}\min \limits _{\begin{array}{c} \scriptstyle 1\le i\le \ell \\ \scriptstyle s=1,2 \end{array}}\big {[}\Pi ^{-1}(\langle M\rangle +\langle F_i^{(s)}\rangle -|G_i^{(s)}|)\Pi e\big {]}, \end{aligned}$$
(4)

and

$$\begin{aligned} 0<\lambda ^{(k,i)}<\frac{2}{1+\Vert \Pi ^{-1}{\mathcal {P}}_i^{(2)}{\mathcal {P}}_i^{(1)}\Pi \Vert _{\infty }}, \end{aligned}$$
(5)

where

$$\begin{aligned} {\mathcal {P}}_i^{(s)}=(\Omega +\langle F_i^{(s)}\rangle )^{-1}(|G_i^{(s)}|+\vert \Omega -M|), s=1,2. \end{aligned}$$
(6)

Proof

Let \(z^{*}\) be the solution of the LCP. By Zhang (2015), \(x^{*} = \frac{\gamma }{2} (z^{*} - \Omega ^{-1}r)\) satisfies the implicit fixed-point equations

$$\begin{aligned} \left\{ \begin{array}{l} (\Omega +F_i^{(1)})x^{*}=G_i^{(1)}x^{*}+(\Omega -M)|x^{*}|-\gamma q, \\ (\Omega +F_i^{(2)})x^{*}=G_i^{(2)}x^{*}+(\Omega -M|x^{*}|-\gamma q. \end{array} \right. \end{aligned}$$
(7)

Denote the errors in the kth step by

$$\begin{aligned} \delta ^{(k)}=x^{(k)}-x^{*}, \delta ^{(k,i)}=x^{(k,i)}-x^{*}. \end{aligned}$$

For \(s=1,2\), since \((\langle F_i^{(s)}\rangle -|G_i^{(s)}|)\Pi \) are s.d.d. matrices, we have

$$\begin{aligned} (\Omega +\langle F_i^{(s)}\rangle )\Pi e\ge \langle F_i^{(s)}\rangle \Pi e\ge (\langle F_i^{(s)}\rangle -|G_i^{(s)}|)\Pi e>0, \end{aligned}$$

which implies that \((\Omega +\langle F_i^{(s)}\rangle )D\) are s.d.d. matrices and \(\Omega +F_i^{(s)}\) are H-matrices. By subtracting (7) from the first and second equations of (3), with Lemma 3 we have

$$\begin{aligned} |\delta ^{(k+\frac{1}{3},i)}|\le |(\Omega +F_i^{(s)})^{-1}|(|G_i^{(s)}|+|\Omega -M|)|\delta ^{(k)}|\le {\mathcal {P}}_i^{(1)}|\delta ^{(k)}|. \end{aligned}$$

Similarly, we can get \(|\delta ^{(k+\frac{2}{3},i)}|\le {\mathcal {P}}_i^{(2)}|\delta ^{(k+\frac{1}{3},i)}|.\) Then, with the third equation of (3), we have

$$\begin{aligned} |\delta ^{(k+1,i)}|\le |\lambda ^{(k,i)}||\delta ^{(k+\frac{2}{3},i)}|+|1-\lambda ^{(k,i)}||\delta ^{(k)}|\le (\lambda ^{(k,i)}{\mathcal {P}}_i^{(2)}{\mathcal {P}}_i^{(1)}+|1-\lambda ^{(k,i)}|I)|\delta ^{(k)}|. \end{aligned}$$

Note that \(x^{(k+1)}=\sum \nolimits _{i=1}^{\ell }{E_ix^{(k+1,i)}}\). We obtain

$$\begin{aligned} |\delta ^{(k+1)}|\le \sum \limits _{i=1}^{\ell }E_i(\lambda ^{(k,i)}{\mathcal {P}}_i^{(2)}{\mathcal {P}}_i^{(1)}+|1-\lambda ^{(k,i)}|I)|\delta ^{(k)}|. \end{aligned}$$
(8)

Now we estimate the infinite norm of \(\Pi ^{-1}{\mathcal {P}}_i^{(s)}\Pi \). By Lemma 4, we have

$$\begin{aligned} \Vert \Pi ^{-1}{\mathcal {P}}_i^{(s)}\Pi \Vert _\infty \le \max \limits _{1\le j\le n}\frac{[(\Omega +\langle F_i^{(s)}\rangle )\Pi e]_j}{[(|G_i^{(s)}|+|\Omega -M|)\Pi e]_j}. \end{aligned}$$
(9)

Since \(\langle M\rangle \Pi \) is an s.d.d. matrix, it holds that \(\Pi ^{-1}(\langle M\rangle +\langle F_i^{(s)}\rangle -|G_i^{(s)}|)\Pi e>0\).

Case 1: If \(\Omega \ge D_M\), we have

$$\begin{aligned}{} & {} (\Omega +\langle F_i^{(s)}\rangle )\Pi e-(|G_i^{(s)}|+|\Omega -M|)\Pi e\\{} & {} \quad =(\Omega +\langle F_i^{(s)}\rangle -|G_i^{(s)}|-|\Omega -M|)\Pi e\\{} & {} \quad \ge (\langle M\rangle +\langle F_i^{(s)}\rangle -|G_i^{(s)}|)\Pi e\\{} & {} \quad >0. \end{aligned}$$

Case 2: If \(D_Me>\Omega e>D_Me-\frac{1}{2}\min \nolimits _{\begin{array}{c} \scriptstyle 1\le i\le \ell \\ \scriptstyle s=1,2 \end{array}}\big {[}\Pi ^{-1}(\langle M\rangle +\langle F_i^{(s)}\rangle -|G_i^{(s)}|)\Pi e\big {]}\), we have

$$\begin{aligned} D_{M}\Pi e>\Omega \Pi e>\big {[}\frac{1}{2}(|M|-\langle F_i^{(s)}\rangle +|G_i^{(s)}|)\big {]}\Pi e \end{aligned}$$

and

$$\begin{aligned}{} & {} (\Omega +\langle F_i^{(s)}\rangle )\Pi e-(|G_i^{(s)}|+|\Omega -M|)\Pi e\\{} & {} \quad =(\Omega +\langle F_i^{(s)}\rangle -|G_i^{(s)}|-|\Omega -M|)\Pi e\\{} & {} \quad \ge (2\Omega -|M|+\langle F_i^{(s)}\rangle -|G_i^{(s)}|)\Pi e\\{} & {} \quad >0. \end{aligned}$$

Then by (9), we have \(\Vert \Pi ^{-1}{\mathcal {P}}_i^{(s)}\Pi \Vert _\infty <1\), which implies that

$$\begin{aligned} \Vert \Pi ^{-1}{\mathcal {P}}_i^{(2)}{\mathcal {P}}_i^{(1)}\Pi \Vert _\infty \le \Vert \Pi ^{-1}{\mathcal {P}}_i^{(2)}\Pi \Vert _\infty \Vert \Pi ^{-1}{\mathcal {P}}_i^{(1)}\Pi \Vert _\infty <1. \end{aligned}$$

By direct computation, we have

$$\begin{aligned} \lambda ^{(k,i)}\Vert {\Pi }^{-1}{\mathcal {P}}_i^{(2)}{\mathcal {P}}_i^{(1)}{\Pi }\Vert _\infty +|1-\lambda ^{(k,i)}|<1 \end{aligned}$$

for \(\lambda ^{(k,i)}\) satisfying (5). Then, we have the next inequality:

$$\begin{aligned}{} & {} \rho \big {(}\sum \limits _{i=1}^\hbar E_i(\lambda ^{(k,i)} {\mathcal {P}}_i^{(2)}{\mathcal {P}}_i^{(1)}+|1-\lambda ^{(k,i)}|I)\big {)}\\{} & {} \quad =\rho \big {(}{\Pi }^{-1}\sum \limits _{i=1}^\hbar E_i(\lambda ^{(k,i)} {\mathcal {P}}_i^{(2)}{\mathcal {P}}_i^{(1)}+|1-\lambda ^{(k,i)}|I)\Pi \big {)}\\{} & {} \quad =\rho \big {(}\sum \limits _{i=1}^\hbar E_i(\lambda ^{(k,i)} \Pi ^{-1}{\mathcal {P}}_i^{(2)}{\mathcal {P}}_i^{(1)}D+|1-\lambda ^{(k,i)}|I)\big {)}\\{} & {} \quad \le \max \limits _{1\le l\le n}\sum \limits _{i=1}^{\hbar }e_l^{(s)}\Vert \lambda ^{(k,i)} \Pi ^{-1}{\mathcal {P}}_i^{(2)}{\mathcal {P}}_i^{(1)}\Pi +|1-\lambda ^{(k,i)}|I\Vert _\infty \\{} & {} \quad \le \max \limits _{1\le l\le n}\sum \limits _{i=1}^{\hbar }e_l^{(s)}(\lambda ^{(k,i)}\Vert \Pi ^{-1}{\mathcal {P}}_i^{(2)}{\mathcal {P}}_i^{(1)}\Pi \Vert _\infty +|1-\lambda ^{(k,i)}|)\\{} & {} \quad <1, \end{aligned}$$

where \(e_l^{(s)}\) is the lth diagonal entry of \(E_i\). Hence, we can have the conclusion that \(\{x^{(k)}\}_{k=1}^{\infty }\) converges by (8), proving the claim. \(\square \)

Remark 1

In the convergence theorems of Zhang (2015), all splittings of M are assumed to be H-compatible splittings, and the positive diagonal matrix \(\Omega \) in should satisfy

$$\begin{aligned} \Omega \ge D_M. \end{aligned}$$
(10)

By Theorem 6, we have the next comments:

  • It is well known that an H-compatible splitting is an H-splitting but not vice versus, which implies that the assumption on the matrix splittings in Theorem 6 is weaker than that of Zhang (2015).

  • Note that by the proof of Theorem 6, we have \(\Pi ^{-1}(\langle M\rangle +\langle F_i^{(s)}\rangle -|G_i^{(s)}|)\Pi e>0\). So (4) provides a larger convergence domain than (10).

  • Compared to Zhang (2015), an extra assumption besides (5) is Assumption (I). In fact, for the commonly used multisplittings, Assumption (II) is easy to be satisfied. Consider the AOR multisplittings in the numerical examples of Zhang (2015):

    $$\begin{aligned} \left\{ \begin{array}{l} F_i^{(1)}=\frac{1}{\alpha }(D_{M}-\beta {L_{M_i}}),G_i^{(1)}=F_i^{(1)}-M, \\ F_i^{(2)}=\frac{1}{\alpha }(D_{M}-\beta {U_{M_i}}),G_i^{(2)}=F_i^{(2)}-M, \end{array} \right. \end{aligned}$$

    where \(-L_{M_i}\) and \(-U_{M_i}\) are the strictly low-triangular and the strictly upper-triangular parts of \(D_M-E_iME_i\), respectively,

    $$\begin{aligned} E_i=\textrm{Diag}(0,0,\ldots ,I_{t_i},0,0,\ldots ,0)\in {\mathbb {R}}^{n\times n}, \end{aligned}$$

    the size \(t_i=\phi _q+1\) if \(i\le \phi _r\), and \(t_i=\phi _q\) otherwise, with \(\phi _q\) and \(\phi _r\) being two nonnegative integers satisfying \(n=\phi _q\ell +\phi _r\) and \(0\le \phi _r\le \ell \). The type of multisplittings can make the computational workload be almost evenly distributed to the \(\ell \) processors; see Bai and Zhang (2013b), Zhang (2015) for more details. By simple computation, we have

    $$\begin{aligned} \langle F_i^{(1)}\rangle -|G_i^{(1)}|=\langle F_i^{(2)}\rangle -|G_i^{(2)}|=\frac{1-|1-\alpha |}{\alpha }D_M-|L_M+U_M|. \end{aligned}$$

    Then, we can get that, if

    $$\begin{aligned} 0<\beta \le \alpha <\frac{2}{1+\rho (D_M^{-1}|L_M+U_M|)}, \end{aligned}$$
    (11)

    \(\Pi \) given by

    $$\begin{aligned} \Pi =\textrm{diag}\Big {[}\big {(}\frac{1-|1-\alpha |}{\alpha }D_M-|L_M+U_M|\big {)}^{-1}e\Big {]} \end{aligned}$$
    (12)

    can satisfy the Assumption (II) of Theorem 6.

Remark 2

In a recent work (Bai xxxx), a paradigm of two-step matrix splitting iteration methods for solving linear equations was constructed. If we take \(\ell =1\), Method 1 reduces to its serial version, whose idea is similar to a special case of the TMSI paradigm given in Bai (xxxx), with the mathematical problem being changed to the LCP. It had been shown in Bai (xxxx) that to find the theoretical optimal relaxation parameter is very difficult even in the case of linear equations. Note that in Theorem 6, the convergence range of \(\lambda ^{(k,i)}\) is given by (5) without analyzing the optimal one theoretically.

4 Numerical examples

In this section, three examples are given to illustrate the efficiency of Method 2.

Example 1

(Dong and Jiang 2009) Consider the LCP arising from finite difference discretization on the equidistant grid of a free boundary value problem about the flow of water through a porous dam, where

$$\begin{aligned} M=\left( \begin{array}{cccccc} S &{} -I &{} &{} &{} &{} \\ -I &{} S &{} -I &{} &{} &{} \\ &{} -I &{} S &{} -I&{} &{} \\ &{} &{} \ddots &{} \ddots &{} \ddots &{} \\ &{} &{} &{} -I &{} S &{} -I \\ &{} &{} &{} &{} -I &{} S \end{array} \right) \in {{\mathbb {R}}^{{n}\times {n}}}, \end{aligned}$$

\(n=m^{2},\) \(S=\textrm{tridiag}({-1,4,-1})\in {{\mathbb {R}}^{m\times {m}}}\) and \(I\in {{\mathbb {R}}^{m\times {m}}}\) is the identity matrix.

Example 2

(Bai 2010) Consider the LCP, where

$$\begin{aligned} M=\left( \begin{array}{cccccc} S &{} -I &{} -I &{} &{}&{} \\ &{} S &{} -I &{} -I &{} &{} \\ &{} &{} \ddots &{} \ddots &{} \ddots &{} \\ &{} &{} &{} S &{} -I &{} -I \\ &{} &{} &{} &{} S &{} -I \\ &{} &{} &{} &{} &{} S \end{array} \right) \in {{\mathbb {R}}^{{n}\times {n}}}, \end{aligned}$$

where \(n=m^{2}\), \(S=\textrm{tridiag}({-1,4,-1})\in {{\mathbb {R}}^{m\times {m}}}\) and \(I\in {{\mathbb {R}}^{m\times {m}}}\) is the identity matrix.

Example 3

(Bai 2010) Consider the LCP, where

$$\begin{aligned} M=\left( \begin{array}{cccccc} T &{} -0.5I &{} &{} &{} &{} \\ -1.5I &{} T &{} -0.5I &{} &{} &{} \\ &{} -1.5I &{} T &{} -0.5I&{} &{} \\ &{} &{} \ddots &{} \ddots &{} \ddots &{} \\ &{} &{} &{} -1.5I &{} T &{} -0.5I \\ &{} &{} &{} &{} -1.5I &{} T \end{array} \right) \in {{\mathbb {R}}^{{n}\times {n}}}, \end{aligned}$$

where \(n=m^{2}\), \(T=\textrm{tridiag}({-1.5,4,-0.5})\in {{\mathbb {R}}^{m\times {m}}}\) and \(I\in {{\mathbb {R}}^{m\times {m}}}\) is the identity matrix.

The numerical tests are performed on a Dell PowerEdge R740, using one NVIDIA Tesla V100S-32GB GPU. The programming language is C, accelerated using OpenACC (NVIDIA 2023) with CUDA Driver Version 12, where the parallel levels are set to “worker=1”, “vector=1”, and “gang=\(2^0, 2^1, \ldots , 2^9\)”, respectively. With \(\ell \) workers, we use the notations \(T_\ell \), \({\mathcal {E}}_\ell \) and \(IT_\ell \) to represent the total computation time (in seconds), the speedup and the iteration steps, respectively, where \({\mathcal {E}}_\ell =T_1/(\ell T_\ell )\). Let \(x^{(0)}=e\) and \(\epsilon =10^{-6}\). With SOR splitting, Method 2 is compared with Method 1, that is the RSMSMSOR versus the SMSMSOR. The positive diagonal matrices \(\Omega \) are chosen as \(\Omega =D_M\) in both two methods. For each case, the parameter \(\alpha \) used in the SOR splittings is chosen as the experimentally optimal one satisfying (11), while the positive diagonal matrix \(\Pi \) is always set as I, which can satisfy Assumption (II) of Theorem 6.

The numerical results of three examples are shown in Tables  12 and 3, respectively. As Example 1 is a ill-conditioned problem that would cost a lot of computational time, the size is set to \(m=128\) and \(m=256\), while the sizes of Examples 2 and 3, well-conditioned, are set to be \(m=1024\) and \(m=2048\). For the choice of the relaxation parameters, \(\lambda ^{(k,i)}\) is set to be \(\lambda ^{(k,i)}\equiv \lambda \), where \(\lambda \) is the optimal parameter obtained experimentally, where \(\lambda =1.9,1.2,1.5\) for Examples 123, respectively.

In terms of iteration steps, for Example 2 and Example 3, when the \(\ell <128\), the iteration steps of each case are relatively stable and not significantly different, while when \(\ell \ge 128\) there is a significant increase in iteration steps, which is related to the number of workers exceeding the maximum parallel efficiency limit of 80. On the other hand, for the ill-conditioned problem Example 1, the threshold for significant changes in iteration steps is reduced to \(\ell =32\). From the comparisons between the SMSMSOR and RSMSMSOR, it can be seen that with relaxation technology, the iteration steps of the RSMSMSOR are significantly less than the SMSMSOR. The specific comparisons of iteration steps are shown in Fig. 1.

In terms of computational time, as \(\ell \) increases, the computational time becomes less and less due to the effects of parallel computing. By the comparisons between the SMSMSOR and RSMSMSOR, it can be seen that the computational time of the RSMSMSOR is significantly less than those of the SMSMSOR. The percentages of the time saved by the RSMSMSOR compared to the SMSMSOR are also shown in Tables 1, 2, 3, where “SAVE” is defined by

$$\begin{aligned} SAVE=\frac{T_\ell ^{SMSMSOR}-T_\ell ^{RSMSMSOR}}{T_\ell ^{SMSMSOR}}\times 100\%. \end{aligned}$$

The specific comparisons of the computational time are shown in Fig. 2, which implies that the introduction of relaxation technology has a significant acceleration effect on the SMSMSOR.

In terms of parallel efficiency, when \(\ell <128\), all the values of the speedup exceed 0.9, while when \(\ell \ge 128\), the low bound of the values of the speedup is larger than 0.6, which is also a satisfied result. This indicates that the numerical experiments based on the OpenACC framework fully utilize the hardware resources of GPU devices during the implementation process to achieve high parallel efficiency.

Table 1 Numerical results of Example 1
Table 2 Numerical results of Example 2
Table 3 Numerical results of Example 3
Fig. 1
figure 1

The comparisons of the iterations steps

Fig. 2
figure 2

The comparisons of the computational time

5 Conclusions

By introducing the relaxation technology, a relaxed two-step modulus-based synchronous multisplitting iteration method has been constructed for solving the LCP. We also provide the convergence analysis and demonstrate the selection range of the relaxation parameter. Parallelly numerical experiments based on OpenACC framework have showed that the proposed method can effectively accelerate the existing methods. Specially, for different examples, the computational time can be reduced by a range from 17% to 48%. However, how to select the optimal relaxation parameter \(\lambda ^{(k,i)}\) is still a challenging problem, which is a very meaningful research work in the future.