Keywords

1 Introduction

We consider the parallel computational methods and technologies for solving very large non-symmetric sparse SLAEs with positive definite matrices

$$\begin{aligned} \begin{array}{l} Au=f, A=\{a_{l,m}\}\in \mathcal {R}^{N,N}, u=\{u_{l}\}, f=\{f_{l}\}\in \mathcal {R}^{N}, \\ (Av,v)\ge \delta ||v||^{2}, \delta > 0, (v,w)=\sum \limits ^{N}_{i=1}v_{i}w_{i}, ||v||^{2}=(v,v), \end{array} \end{aligned}$$
(1)

which arise in finite element or finite volume approximations of the multi-dimensional boundary value problems (BVPs) on the adaptive non-structured grids. Let we have PDE

$$\begin{aligned} \begin{array}{l} Lu({{\varvec{r}}})=f({{\varvec{r}}}), {{\varvec{r}}}\in \varOmega , \\ lu|_{\varGamma }=g({{\varvec{r}}}), {{\varvec{r}}}\in \varGamma , \bar{\varOmega }=\varOmega \bigcup \varGamma , \end{array} \end{aligned}$$
(2)

where L is some linear differential operator with piece-wise smooth coefficients and l is boundary condition operator which has different types (Dirichlet, Neumann or Robin) at the different surface segments of \(\varGamma \), in general. The computational domain \(\bar{\varOmega }=\varOmega \bigcup \varGamma \) may have complicated geometry with multi-connected piece-wise boundary surfaces and contrast material properties in subdomains. We suppose that initial data of BVP (2) provide the existence of the unique solution \(u({{\varvec{r}}})\) with the smooth enough properties, which are sufficient for validity of the numerical methods to be applied.

In recent decades, there are a lot of literature on the parallel domain decomposition methods, and we present in the reference some books and papers only [1,2,3,4]. The main approaches in question for DDM include the balancing decomposition of the grid computational domain into parameterized overlapping or non-overlapping subdomains with different interface conditions on the internal boundaries. Also, we use two different structures of the contacting the neigbour grid subdomains: with definition or without definition of the node dividers (separators) as the special (sceleton) grid subdomain. The first type decomposition (with sceleton grid subdomain) is usual for FETI approach of DDM, see [1, 2], but in the second case the original matrix A has more regular block-diagonal structure. The proposed Schwarz parallel additive algorithms are based on the flexible multi-preconditioned semi-conjugate direction methods in the block Krylov subspaces. The acceleration of two - level iterative processes is provided by means of aggregation, or coarse grid correction approach, with different orders of basic functions, which realize a low - rank approximation of the original matrix. The auxiliary subsystems in subdomains are solved by direct or by the Krylov iterative methods. The parallel implementation of algorithms is based on hybrid programming with MPI-processes and multi-thread computing for the upper and the low levels of iterations, respectively.

This paper is organized as follows. In the Sect. 2 we consider the geometric issues of the different types of domain decompositions. The next section includes the description of the two level iterative processes for solving SLAEs in Krylov subspaces, on the base of multi-preconditioning approache which was proposed by C. Greif with his colleagues in [5,6,7]. The Sect. 4 is devoted to the parallel implementation of the algorithms, which are realized in the framework of the library KRYLOV, Institute of Computational Mathematics and Mathematical Geophysics, SB RAS, Novosibirsk. The technical requirements of this code are based on the absence of the formal program constraints on the degree of freedom and on the number of processor units. In the conclusion we discuss the efficiency of the proposed methods and technologies, as well as the conceptions of the creating the unified numerical envirenment for fast solving very large sparse SLAEs and high perfomance computing for parallel DDMs.

2 Geometric Issues of DDM

Domain decomposition approaches can be considered at the continuous level and at the discrete level. We use the second way and suppose that the original computational domain \(\varOmega \) is discretized already into grid computation domain \(\varOmega ^{h}\). So, in the following, DDM is implemented to the grid domains only, and upper index “h” is omitted for bravity.

Let us decompose \(\varOmega \) into P subdomains (with or without overlapping):

$$\begin{aligned} \begin{array}{l} \varOmega =\bigcup \limits ^{P}_{q=1}\varOmega _{q}, \bar{\varOmega }_{q}=\varOmega _{q}\bigcup \varGamma _{q}, \varGamma _{q}=\bigcup \limits _{q^{\prime }\in \omega _{q}}\varGamma _{q,q^{\prime }}, \varGamma _{q,q^{\prime }}=\varGamma _{q}\bigcap \bar{\varOmega }_{q^{\prime }}, q^{\prime }\ne q. \end{array} \end{aligned}$$
(3)

Here \(\varGamma _{q}\) is the boundary of \(\varOmega _{q}\) which is composed from the segments \(\varGamma _{q,q^{\prime }}\), \(q^{\prime }\in \omega _{q}\), and \(\omega _{q}=\{q_{1},...,q_{M_{q}}\}\) is a set of \(M_{q}\) contacting, or conjuncted, subdomains. Formally, we can denote also by \(\varOmega _{0}=R^{d}/\varOmega \) the external subdomain:

$$\begin{aligned} \begin{array}{l} \bar{\varOmega }_{0}=\varOmega _{0}\bigcup \varGamma , \varGamma _{q,0}=\varGamma _{q}\bigcap \bar{\varOmega }_{0}=\varGamma _{q}\bigcap \varGamma , \varGamma _{q}=\varGamma ^{i}_{q}\bigcup \varGamma _{q,0}, \end{array} \end{aligned}$$
(4)

where \(\varGamma ^{i}_{q}=\bigcup \limits _{q^{\prime }\ne 0}\varGamma _{q,q^{\prime }}\) and \(\varGamma _{q,0}=\varGamma ^{e}_{q}\) mean internal and external parts of the boundary of \(\varOmega _{q}\). We define also an overlapping \(\varDelta _{q,q^{\prime }}=\varOmega _{q}\bigcap \varOmega _{q^{\prime }}\) of the neighbouring subdomains. If \(\varGamma _{q,q^{\prime }}=\varGamma _{q^{\prime },q}\) and \(\varDelta _{q,q^{\prime }}=0\) then overlapping of \(\varOmega _{q}\) and \(\varOmega _{q^{\prime }}\) is empty. In particular, we suppose in (3) that each of P subdomains has no intersection with \(\varOmega _{0}\) \((\varOmega _{q}\bigcap \varOmega _{0}=0)\).

The idea of DDM includes the definition of the sets of BVPs for all subdomains which should be equivalent to the original problem (1):

$$\begin{aligned} \begin{array}{l} Lu_{q}({{\varvec{r}}})=f_{q}, {{\varvec{r}}}\in \varOmega _{q}, l_{q,q^{\prime }}(u_{q})\big |_{\varGamma _{q,q^{\prime }}}=g_{q,q^{\prime }}\equiv l_{q^{\prime },q}(u_{q^{\prime }})\big |_{\varGamma _{q^{\prime },q}}, \\ q^{\prime }\in \omega _{q}, l_{q,0}u_{q}|_{\varGamma _{q,0}}=g_{q,0}, \; q=1,...,P. \end{array} \end{aligned}$$
(5)

At each segment of the internal boundaries of subdomains, with operators \(l_{q,q^{\prime }}\) from (4), the interface conditions in the form of the Robin boundary condition are imposed:

$$\begin{aligned} \begin{array}{l} \alpha _{q}u_{q}+\beta _{q} {\displaystyle \frac{\partial u_{q}}{\partial {{\varvec{n}}}_{q}}}\big |_{\varGamma _{q,q^{\prime }}}= \alpha _{q^{\prime }}u_{q}+\beta _{q^{\prime }} {\displaystyle \frac{\partial u_{q^{\prime }}}{\partial {{\varvec{n}}}_{q^{\prime }}}}\big |_{\varGamma _{q^{\prime },q}}, |\alpha _{q}|+|\beta _{q}|> 0, \quad \alpha _{q}\cdot \beta _{q}\ge 0. \end{array} \end{aligned}$$
(6)

Here \(\alpha _{q^{\prime }}=\alpha _{q}, \beta _{q^{\prime }}=\beta _{q}\) and \({{\varvec{n}}}_{q}\) means the outer normal to the boundary segment \(\varGamma _{q,q^{\prime }}\) of the subdomain \(\varOmega _{q}\). Strictly speaking, at each part of the internal boundary \(\varGamma _{q,q^{\prime }}, q^{\prime }\ne 0\), the pair of different coefficients \(\alpha ^{(1)}_{q}, \beta ^{(1)}_{q}\) and \(\alpha ^{(2)}_{q}, \beta ^{(2)}_{q}\) for the conditions of the type (5) should be given. For example, \(\alpha ^{(1)}_{q}=1\), \(\beta ^{(1)}_{q}=0\) and \(\alpha ^{(2)}_{q}=0\), \(\beta ^{(2)}_{q}=1\) correspond formally to the continuity of the solution to be sought for and its normal derivative respectively. The additive Schwarz algorithm in DDM is based on the iterative process, in which the BVPs in each q-th subdomain are solved simultaneously, and right hand sides of boundary condition in (5), (6) are taken from the previous iteration.

We implement domain decomposition in two steps. At the first one, we define subdomains \(\varOmega _{q}\) without overlapping, i.e. the contacting grid subdomains have no the joint nodes, and each node belongs to one subdomain only. Then we define the grid boundary \(\varGamma _{q}=\varGamma ^{0}_{q}\) of \(\varOmega _{q}\), as well as the extensions of \(\bar{\varOmega }^{t}_{q}=\varOmega ^{t}_{q}\bigcup \varGamma ^{t}_{q}, \; \varOmega ^{0}_{q}=\varOmega _{q}, \; t=0,...,\varDelta \), layer by layer:

$$\begin{aligned} \begin{array}{l} \varGamma _{q}\equiv \varGamma ^{0}_{q}=\{l^{\prime }\in \hat{\omega }_{l}, l\in \varOmega _{q}, l^{\prime }\notin \varOmega _{q}, \varOmega ^{1}_{q}=\bar{\varOmega }^{0}_{q}=\varOmega _{q}\bigcup \varGamma ^{0}_{q}\}, \\ \varGamma ^{t}_{q}=\{l^{\prime }\in \hat{\omega }_{l}, l\in \varOmega ^{t-1}_{q}, l^{\prime }\in \varOmega ^{t-1}_{q}, \varOmega ^{t}_{q}=\bar{\varOmega }^{t-1}_{q}=\varOmega ^{t-1}_{q}\bigcup \varGamma ^{t-1}_{q}\}. \end{array} \end{aligned}$$
(7)

Here \(\varDelta \) means the parameter of extension, or overlapping.

At the Fig. 1, we present an example of 2D grid domain decomposition with grid sceleton subdomain whose node dividers are denoted by crosses.

Fig. 1.
figure 1

Decomposition of 2-D domain with grid sceleton subdomain

The second example is presented at the Fig. 2 where we have the grid decomposition without node separators for overlapping parameter \(\varDelta =3\).

Algebraic interpretation of DDM, after approximations of BVPs (5), is described by the block version of SLAEs (1):

$$\begin{aligned} \begin{array}{l} A_{q,q}u_{q}+\sum \limits _{r\in \hat{\omega }_{q}}A_{q,r}u_{r}=f_{q}, q=1,...,P, \end{array} \end{aligned}$$
(8)

where \(u_{q}, f_{q}\in \mathcal {R}^{N^{\varDelta }_{q}}\) are subvectors with the components which belong to corresponding subdomain \(\varOmega ^{\varDelta }_{q}\), and \(N^{\varDelta }_{q}\) is the number of nodes in \(\varOmega ^{\varDelta }_{q}\).

In the case for Fig. 1, if the sceleton subdomain is numbered as the last one, the block matrix A has the following arrow type structure:

$$ A=\{A_{q,r}\}=\left| \begin{array}{ccc|cc} A_{1,1} &{} &{} 0 &{} A_{1,P+1} \\ &{} \ddots &{} &{} \vdots \\ 0 &{} &{} A_{P,P} &{} A_{P,P+1} \\ --- &{} --- &{} --- &{} ---\\ A_{P+1,1} &{} \cdots &{} A_{P+1,P} &{} A_{P+1,P+1} \end{array}\right| . $$

In the second case (decomposition without node dividers, Fig. 2), the matrix A has more regular block-diagonal structure.

Fig. 2.
figure 2

Decomposition of the grid domain without dividing nodes

The implementation of the interface conditions between adjacent subdomains can be described as follows. Let the l-th node be a near-boundary one in subdomain \(\varOmega _{q}\), see Fig. 3. Then write down the corresponding equation in the form

$$\begin{aligned} \begin{array}{l} (B_{q,q}u)_{l}\equiv (a_{l,m}+\theta _{l}\sum \limits _{m\notin \omega _{q}}a_{l,m})u_{l}+{\displaystyle \sum \limits _{m\in \omega _{q}}}a_{l,m}u_{m}=f_{l}+{\displaystyle \sum \limits _{m\notin \omega _{q}}}a_{l,m}(\theta _{l}u_{l}-u_{m}). \end{array} \end{aligned}$$
(9)

Here \(\theta _{l}\) is some parameter which corresponds to different type of boundary condition at the boundary \(\varGamma _{q}\): \(\theta _{l}=0\) corresponds to Dirichlet condition, \(\theta _{l}=1\) corresponds to Neumann condition, and \(\theta _{l}\in (0, 1)\) – to the Robin boundary condition. The diagonal blocks \(B_{q,q}\) define the block-diagonal matrix \(B_{s}=\text{ block-diag } \{B_{q,q}\}\) for the additive Schwarz (AS) iterative process. In the implementation of AS we take the right hand side of (9) from the previous iteration.

Fig. 3.
figure 3

The grid stencil for near boundary node

The additive Schwarz iterative algorithm is defined by the corresponding preconditioning matrix \(B_{AS}\) which can be described as follows. For subdomain \(\varOmega ^{\varDelta }_{q}\) with overlapping parameter \(\varDelta \) we define a prolongation matrix \(R^{T}_{q,\varDelta }\in \mathcal {R}^{N,N^{\varDelta }_{q}}\) which extends vectors \(u_{q}=\{u_{l},\;l\in \varOmega ^{\varDelta }_{q}\}\in \mathcal {R}^{N^{\varDelta }_{q}}\) to \(\mathcal {R}^{N}\) by the relations

$$ (R^{T}_{q,\varDelta }u_{q})_{l}=\left\{ \begin{array}{cc} (u_{q})_{l} &{} \text{ if } l\in \varOmega ^{\varDelta }_{q}, \\ 0 &{} \text{ otherwise }. \end{array}\right. $$

The tranpose of this matrix defines a restriction operator which restricts vectors in \(\mathcal {R}^{N}\) to the subdomain \(\varOmega ^{\varDelta }_{q}\). The diagonal block of the preconditioning matrix \(B_{AS}\), which presents the restriction of the discretized BVP to the q-th subdomain, is given by \(\hat{A}_{q}=R_{q,\varDelta }AR^{T}_{q,\varDelta }\). In these terms, the additive Schwarz preconditioner is defined as

$$ B_{AS}=\sum \limits ^{P}_{q=1}B_{AS,q}, B_{AS,q}=R^{T}_{q,\varDelta }\hat{A}^{-1}_{q}R_{q,\varDelta }. $$

Also, we define so called restricted additive Schwarz preconditioner by considering the prolongation \(R^{T}_{q,0}\) instead of \(R^{T}_{q,\varDelta }\), i.e.

$$ B_{RAS}=\sum \limits ^{P}_{q=1}B_{RAS,q}, B_{RAS,q}=R^{T}_{q,0}\hat{A}^{-1}_{q}R_{q,\varDelta }. $$

Let us remark, that \(B_{RAS}\) is non-symmetric matrix even A is symmetric one.

The second preconditioning matrix which we use for the DDM iterations in Krylov subspaces is responsible for the coarse grid correction, or aggregation approach, based on the low-rank approximation of the original matrix A. We define coarse grid, or macrogrid, \(\varOmega _{c}\) and corresponding coarse space with degree of freedom \(N_{c}\ll N\), as well as some basic functions \(w^{k}\in \mathcal {R}^{N}, \; k=1,...,N_{c}\). We suppose that the rectangular matrix \(W=(w^{1}...w^{N_{c}})\in \mathcal {R}^{N,N_{c}}\) has a full rank and define the coarse grid preconditioner \(B_{c}\) as follows:

$$ B^{-1}_{c}=W\hat{A}^{-1}W^{T}, \hat{A}=W^{T}AW\in \mathcal {R}^{N_{c},N_{c}}, $$

where small matrix \(\hat{A}\) is low rank approximation of AW is called restriction matrix and transposed matrix \(W^{T}\) is prolongation one.

In some papers, see [4, 8], aggregation, or deflation, technique is applied for improvement of the initial guess for Krylov’s iterative methods. Let \(u^{-1}\) be an arbitrary vector. Then we can compute the initial vectors \(u^{0}\) and \(r^{0}\) by the formulaes

$$\begin{aligned} \begin{array}{l} u^{0}=u^{-1}+B^{-1}_{c}r^{-1}, r^{-1}=f-Au^{-1}, \\ r^{0}=f-Au^{0}, p^{0}=r^{0}-B^{-1}_{c}r^{0}, \end{array} \end{aligned}$$
(10)

which provide the orthogonal properties

$$\begin{aligned} \begin{array}{l} W^{T}r^{0}=0, W^{T}Ap^{0}=0, \end{array} \end{aligned}$$
(11)

where \(p^{0}\) is convential initial direction vector in the Krylov’s methods.

It is possible to choose the basic functions \(w^{k}({{\varvec{r}}})\) from the approximation principle. If the solution u of the original problem is smooth enough, we can write

$$ u=\big \{u_{l}\approx u^{c}_{l}=\sum \limits ^{N_{c}}_{k=1}c_{k}w^{k}({{\varvec{r}}}_{l})\big \}\cong W\hat{u}, $$

where the vector \(\hat{u}=\{c_{k}\}\in \mathcal {R}^{N_{c}}\) consists of the coefficient of the expansion. If we substitute this representation in (1), we obtain the approximate equation \(AW\hat{u}\approx f\). From here, we have by multiplication with \(W^{T}\) the both sides of this equation,

$$\begin{aligned} \begin{array}{l} W^{T}AW\hat{u}\approx W^{T}f, \hat{u}\cong \hat{A}^{-1}W^{T}f, \hat{A}=W^{T}AW, \\ u\approx W\hat{u}\approx B^{-1}_{c}f, B_{c}=W\hat{A}^{-1}W^{T}. \end{array} \end{aligned}$$
(12)

It is natural to use basic function \(w^{k}({{\varvec{r}}})\) as some finite interpolation functions of the different orders. In the simplest case the functions \(w^{k}({{\varvec{r}}}), \; k=1,...,N_{c}=P\), are choosen as unit in k-th subdomain and equal to zero in the other subdomains. It is important, that in general the coarse grid \(\varOmega _{c}\) does not depend of the domain decomposion.

Let us remark, that instead of the deflation approach (12), which is based on the multiplication of the both sides of the original SLAE with \(W^{T}\), we can use multiplication with \(W^{T}A^{T}\). In this case we obtain

$$ \begin{array}{l} W^{T}A^{T}AW\check{u}\approx W^{T}A^{T}f, \check{u}=\check{A}^{-1}W^{T}A^{T}f, \\ u\approx W\check{u}\approx \check{B}^{-1}_{c}f, \check{A}=W^{T}A^{T}AW, \check{B}_{c}=W\check{A}^{-1}W, \end{array} $$

and application of the formulae (10) with \(\check{B}_{c}\), instead of \(B_{c}\), provides the initial guess with the other kind of orthogonal properties:

$$ W^{T}A^{T}r^{0}=0, W^{T}A^{T}Ap^{0}=0. $$

3 Multi-preconditioned SCR

Now, we present the general description of the multi-preconditioned semi-conjugate residual iterative method. Let \(r^{0}=f-Au^{0}\) be an initial residual of the algebraic system, and \(B^{(1)}_{0},...,B^{(m_{0})}_{0}\) – be a set of some non-singular easily invertible preconditioning matrices. Using them, we define a rectangular matrix composed of the initial direction vectors \(p^{0}_{k},\, k=1,...,m_{0}\):

$$\begin{aligned} \begin{array}{l} P_{0}=[p^{0}_{1}\cdots p^{0}_{m_{0}}]\in \mathcal {R}^{N,m_{0}}, p^{0}_{l}=(B^{(l)}_{0})^{-1}r^{0}, \end{array} \end{aligned}$$
(13)

which are assumed to be linearly independent.

Successive approximations \(u^{n}\) and the corresponding residuals \(r^{n}\) will be sought for with the help of the recursions

$$\begin{aligned} \begin{array}{l} u^{n+1}=u^{n}+P_{n}\bar{\alpha }_{n}=u^{0}+P_{0}\bar{\alpha }_{0}+...+P_{n}\bar{\alpha }_{n}, \\ r^{n+1}=r^{n}-AP_{n}\bar{\alpha }_{n}=r^{0}-AP_{0}\bar{\alpha }_{0}-...-AP_{n}\bar{\alpha }_{n}. \end{array} \end{aligned}$$
(14)

Here \(\bar{\alpha }_{n}=(\alpha ^{1}_{n},...,\alpha ^{m_{n}}_{n})^{T}\) are \(m_{n}\)-dimensional vectors. The direction vectors \(p^{n}_{l},\; l=1,...,m_{n}\) forming the columns of the rectangular matrices \(P_{n}=[P^{n}_{1}\cdots P^{n}_{m_{n}}]\in \mathcal {R}^{N,m_{n}}\) are defined as orthogonal ones in the sense of satisfying the relations

$$\begin{aligned} \begin{array}{l} P^{T}_{n}A^{T}AP_{k}=D_{n,k}=0 \text{ for } k\ne n, \end{array} \end{aligned}$$
(15)

where \(D_{n,n}=diag\{\rho _{n,l}\}\) is a symmetric positive definite matrix, because the matrices \(P_{k}\) have the full rank as is supposed.

Orthogonal properties (15) provide the minimization of the residual norm \(||r^{n+1}||_{2}\) in the block Krylov subspace of the dimension \(M_{n}\):

$$\begin{aligned} \begin{array}{l} K_{M_{n}}=Span\{P_{0},...,A^{n-1}P_{n-1}\}, M_{n}=\sum \limits ^{n-1}_{k=0}m_{k}, \end{array} \end{aligned}$$
(16)

if we define the coefficient vectors \(\bar{\alpha }_{n}\) by the formula

$$\begin{aligned} \begin{array}{l} \bar{\alpha }_{n}=\{\alpha _{n,l}\}=(D^{-1}_{n,n})^{-1}P^{T}_{n}A^{T}r^{0}. \end{array} \end{aligned}$$
(17)

For such values of \(\bar{\alpha }_{n}\) it is easy to check that the vectors \(p^{n}_{k},\; r^{n}_{k}\) satisfy to semi-conjugation condition, i.e.

$$\begin{aligned} \begin{array}{l} P^{T}_{k}A^{T}r^{n+1}=0, k=0,1,...,n. \end{array} \end{aligned}$$
(18)

In this case, the following relations are valid for the functionals of the residuals:

$$\begin{aligned} \begin{array}{l} ||r^{n+1}||^{2}\equiv (r^{n+1}, r^{n+1})=(r^{n}, r^{n})-(C_{n}r^{0}, r^{0})= \\ (r^{0},r^{0})-(C_{0}r^{0},r^{0})-...-(C_{n}r^{0}, r^{0}), C_{n}=P_{n}AD^{-1}_{n,n}A^{T}P^{T}_{n}. \end{array} \end{aligned}$$
(19)

Let us remark that such properties are valid for any direction vectors \(p^{n}_{k}\) which satisfy to orthogonal condition (15). We will look for the matrices composed of the direction vectors from the recurrent relations

$$\begin{aligned} \begin{array}{l} P_{n+1}=Q_{n+1}-\sum \limits ^{n}_{k=0}P_{k}\bar{\beta }_{k,n}, \end{array} \end{aligned}$$
(20)

where the auxiliary matrices

$$\begin{aligned} \begin{array}{l} Q_{n+1}=[q^{n+1}_{1}...\; q^{n+1}_{m_{n}}], q^{n+1}_{l}=(B^{(l)}_{n+1})^{-1}r^{n+1},l=1,...,m_{n}, \end{array} \end{aligned}$$
(21)

are introduced, \(B^{(l)}_{n+1}\) are some non-singular easy invertible preconditioning matrices and \(\bar{\beta }_{k,n}\) are the coefficient vectors, which are defined after substitution of (18) into orthogonality conditions (15,) by the formula

$$\begin{aligned} \begin{array}{l} \bar{\beta }_{k,n}=D^{-1}_{k,k}P^{T}_{k}A^{T}AQ_{n+1}. \end{array} \end{aligned}$$
(22)

The following statement is valid.

Theorem 1

Let the matrices A and \(B^{(l)}_{n}, \; n=0,1,...; \; l=1,...,m_{n}\) be nonsingular, and the matrices \(P_{n}\) have a full rank. Then the iterative process (14), (17), (20)–(22) provides minimization of the residual norm \(||r^{n}||\) in the block Krylov subspaces (16). Moreover, the following semi-orthogonal properties of residual vectors are valid:

$$\begin{aligned} \begin{array}{l} (A^{\gamma }B^{-1}_{k,l}r^{n},r^{k})=\left\{ \begin{array}{lr} 0, \quad k< n, \\ \sigma ^{(\gamma )}_{n}=(A^{\gamma }B^{-1}_{n,l}r^{n},r^{n}), \quad k=n. \end{array}\right. \end{array} \end{aligned}$$
(23)

Also, the coefficients \(\alpha _{n,l}\) can be computed by the formula

$$\begin{aligned} \begin{array}{l} \alpha ^{(\gamma )}_{n,l}=(A^{\gamma }B^{-1}_{n,l}r^{n},r^{n})/\rho _{n,l}. \end{array} \end{aligned}$$
(24)

instead of (17).

The presented MPSCR method use the dynamic, or flexible, definition of the preconditioners \(B^{(l)}_{n}\) and, moreover, their number \(m_{n}\) is variable at the different iteration steps. We propose to use the “total-flexible” variants of “coarse” preconditioning matrics \(B_{c}\) and Jacobi-Schwarz ones \(B_{s}\). It means that at each n-th iteration we can apply several number of the different coarse grids \(\varOmega ^{l^{\prime }}_{c,n}, \; l^{\prime }=0,1,...,m^{c}_{n}\), of the corresponding dimensions \(N^{l^{\prime }}_{c,n}\), and different number ofSchwartz – type preconditioners \(B^{l^{\prime \prime }}_{s,n}, \; l^{\prime \prime }=0,1,...,m^{s}_{n}\;\; (m^{s}_{n}=0 \text{ or } m^{c}_{n}=0\) means no using the corresponding preconditioning at the current iterations). Let us note that application of several Schwarz preconditioners at each iteration corresponds to weighted version of domain decomposition, proposed by Greif in [7], and using several aggregation approaches at one iterative step can be interpretated, in a sense, as additive multi-grid techniques. The simplified versions of SCR where considered in [9]–[11], and in [10] it is called as Generalized Conjugate Residual (GCR) method. In a sense, SCR method is an alternative to wellknown GMRES, see [12].

The disadvantage of the considered algorithm, as well as other Krylov’s methods for solving non-symmetric SLAEs, is the necessity of saving the all direction vectors what requires a lot of memory for large number of iterations. There are two main ways to avoid these circumstances. The first one consists into developing the restarts periodically: at each iteration with the numbers \(n_{r}=r\cdot n_{rest}, \; r=1,2,...,\) the current residual vector is computed not by recurion (14), but from equation \(r^{n_{r}}=f-Au^{n_{r}}\), and the next values \(u^{n+1}, r^{n+1}, P_{n+1}\) are computed recursivaly again from the beginning. In the second way the limited orthogonalization is used, and only \(n_{lim}\) last direction matrices \(P_{n}, P_{n-1},...,P_{n-n_{lim}}\) are saved and used in the formula (20). Also, for the case when we have too large deminsions of the Krylov subspaces, in [7] it was proposed to use selected multi-preconditioning when some preconditioners and corresponding direction vectors are omitted on some steps of the iterative process.

Let us remark also, that in the overlapping DDM the vectors \(u^{n}\) are overdetermined in the intersections of the subdomains, and we use restricted additive Schwartz approach with the preconditioner \(B_{RAS}\) in this case.

4 Parallel Algorithms and Technologies

The presented principles of the constructing the algorithms are implemented in the library Krylov [12], Institute of Computational Mathematics and Mathematical Geophysics, SB RAS, Novosibirsk, for the efficient parallel solution of the large SLAEs with sparse matrices, which are saved in the compressed sparse row format (CSR, [13]). Of course, the convertors to other convential the key approach for the automatical construction partitioning of the weighted oriented graph that presents the structure of grid set of equation, see Fig. 4. The synchronization of the distributed computing in DDMs is provided by the MPI-processes which are implemented for corresponding subdomains at the multi-core CPUs with shared memory.

The main requirements to develop a proper software are the following:

  • no program formal restrictions on the degree of freedom of algebraic systems to be solved and on the number of the processor units or computational cores used; in another words, the numerical envirenment would be provide the scalable parallelism in the weak and/or in the strong sense;

  • application of the advanced iterative methods, with the possibility of the extension of the library functionality;

  • robust implementation on the base of hybrid programming of two-level computational process: using MPI tools for outer Krylov’s iterations and multi-thread techniques for solving the algebraic subsystems in subdomains;

  • high performance computing by using the efficient SPARSE BLAS tools [13] and communication optimization.

Fig. 4.
figure 4

The example of oriented weighted graph

The library tools include automatical construction of the balancing domain decomposition, application of the different number of subdomains, size of overlapping, type of interface conditions, using various preconditioners and Krylov’s algorithms, etc. The current version of the library does not use multi-GPU computer configuration yet, and corresponding development is in progress.

5 Conclusion

At present time, there are many wellknown libraries and packages of algebraic iterative solvers for large sparse SLAEs: HYPRE, PETSc, Sparse Kit, and others,– which are available by context at the Internet. The goal of the creating the new library Krylov consists in the development of the extendable efficient envirenment for scalable parallel solving the various types of grid algebraic systems (real and complex, symmetric and non-symmetric, positive definite and non-definite, Hermitian and non-Hermitian, etc.) by advanced approaches of DDM by means of hybrid programming at the geterogenous multi-CPU, multi-core and multi-GPU computers. The program implementation is organized as Open Source adapted to the evolution of the computer architectures and platforms. This library has two-fold destinations. The first one consists in the providing the numerical tools for automatical construction of the algorithms, fast developing, validation, verification, testing, and comparative analysis of the new methods. The second aim includes the development of the high preformance code and friendly interface for the end users. In principle, this is not group project, and it is oriented to the wide cooperation of the computational algebra community.