1 INTRODUCTION

Preconditioned iterative methods in Krylov subspaces for solving systems of linear algebraic equations (SLAEs) is an important achievement of computational mathematics in the 20th century. Their significance becomes especially high with the development of supercomputer simulation and solution of interdisciplinary direct and inverse, nonlinear, and nonstationary multidimensional problems with complicated geometric and material properties. Modern requirements for the accuracy of solving problems with real life data lead to superlarge systems (1010–1011 and greater) and huge conditioning numbers (\({{10}^{{14}}}\) and higher) when the standard double precision computations are already on the verge of being available. In this case, the computational cost of solving SLAEs exceeds 80% of the total computational cost for large-scale computer simulations.

Of the most interest for researchers are algebraic systems with sparse matrices that arise as a result of approximation of boundary value problems using finite element and finite volume methods and discontinuous Galerkin algorithms of various orders on unstructured grid (see the review in [1]). In these cases, the portraits or graphs of matrices have an irregular structure, i.e., the distribution of nonzero elements is specified only by their enumeration, and their values are stored in compressed formats, which is an important feature in the implementation of algorithms on modern supercomputers with heterogeneous structure and hierarchical shared memory. The spectral and structure properties of SLAEs are significantly different in different applications—electromagnetism, fluid dynamics, elasticity and plasticity, heat and mass transfer equations, etc., which gives rise to a large variety of algorithms, the development of which goes in the direction of both the study of new iterative processes with various orthogonal, variational and projection properties and the construction of effective preconditioned matrices. A combination of these methodologies determines success in this area. The results of studies are described in the books by Axelsson [2], Elman, Silvester, and Wathen [3], Marchuk and Kuznetsov [4], Saad, [5], Van der Vorst [6], Olshanskii and Tyrtyshnikov [7], Liesen and Strakos [8], the author of his paper [9, 10], and in a huge number of journal publications and conference proceedings.

Speaking about the generalization of Krylov iterative methods, we note that there are various approaches to enriching the corresponding subspaces in which residual or direction vectors for finding new iterative approximations are determined on the basis of various orthogonal of projective principles, including deflation and augmenting algorithms in combination with spectral optimization or least squares principle. It is of interest that there were attempts at constructing “alternatives” to methods in Krylov subspaces, such as Anderson acceleration (which was initially proposed for solving nonlinear systems [1113]) and Sonneveld subspaces [14]); however, in actual fact they turned out to be variations on the general theme.

The ever increasing stream of literature on methods for solving SLAEs contains a rich palette of ideas for constructing preconditioning matrices that should be easily invertible, increase the convergence rate of iterative algorithms, and ultimately ensure their overall performance. Here multigrid methods, which are asymptotically optimal with respect to order (the amount of computations is proportional to the number of degrees of freedom of the discrete problem), gained a second youth. The methodologies of matrix decomposition have also been significantly developed, including low-rank matrix approximations and nested and alternate triangular factorizations. Various approaches formed a uniform technology for constructing multipreconditioned processes in Krylov subspaces, which are generally multilevel, and additive domain decomposition methods (DDMs), which are the main tool for parallelizing multidimensional problems on multiprocessors. Methods of scalable parallelization and improving the performance of DDMs on supercomputers of heterogeneous architecture with distributed and hierarchical shared memory form the key problem in modern computational algebra [5, 1619]. Note that the methods for solving SLAEs arising in implicit approximations of initial boundary value problems have some specific features (see the review in [20]). In particular, it was shown in [21] that the choice of initial iterative approximations at each time step using a generalization of the predictor-corrector algorithm can significantly improve the efficiency of computations.

The practical demand for algebraic calculations has led over a half-century history to a huge amount of software, both free and commercial; a fairly complete list of such software can be found in [22], and a classification of algorithms is given in [23]. Here we pay attention to the tools of general importance like SPARSE BLAS and widespread libraries PETSc, HYPRE, MКL INTEL, etc. The international community of experts in computational algebra developed standardized formats for representing matrices and corresponding converters, as well as large collections of matrices from real-life simulation problems that are indispensable for testing and comparative analysis of algorithms. An important trend in recent decades is the transition from concrete libraries and packages of application programs to integrated computational environments, examples of which are DUNE, INMOST, OPEN FOAM, and MATLAB, and the basic simulation system, the concept of which is described in [24] and includes principles of flexible extension of the composition of models and algorithms, as well as the adaption to the evolution of computer architectures, reuse of external software, and coordinated participation of various development groups, which should facilitate the creation of an effective new generation ecosystem with a long lifecycle that joins the communities of mathematicians, software developers, and users from various application domains. On the whole, the high-performance solution of SLAEs for a wide class of practical problems is a promising field of activity for the intellectualization of both research and technological areas.

This paper is organized as follows. In Section 2, we describe the general modern principles of constructing iterative methods in Krylov subspaces, including two-level methods. Section 3 contains a brief review of approaches to constructing preconditioned matrices. Section 4 describes parallelization and performance improving technologies for solving SLAEs, including issues of their software implementation and effective use for simulating real-live processes and phenomena. In the final section, we discuss the prospects of developing the issues under consideration.

2 ORTHOGONAL AND VARIATIONAL PRINCIPLES OF DESIGNING KRYLOV ITERATIVE METHODS

We consider real algebraic systems

$$Au = f,\quad A = {\text{\{}}{{a}_{{l,m}}}{\text{\}}} \in {{\mathcal{R}}^{{N,N}}},\quad u = {\text{\{}}{{u}_{l}}{\text{\}}},\quad f = {\text{\{}}{{f}_{l}}{\text{\}}} \in {{\mathcal{R}}^{N}},$$
(1)

with positive semi-definite matrices

$$(A{v},{v}) \geqslant \delta {{\left\| {v} \right\|}^{2}},\quad \delta \geqslant 0,\quad ({v},w) = \sum\limits_{i = 1}^N {{{{v}}_{i}}} {{w}_{i}},\quad {{\left\| {v} \right\|}^{2}} = ({v},{v}),$$

among which an important class consists of symmetric positive definite (s.p.d.) matrices with \(\delta > 0\). The SLAEs under examination can be represented in block form

$$\begin{gathered} {{A}_{{q,q}}}{{u}_{q}} + \sum\limits_{r \in \Omega q} {{{A}_{{q,r}}}} {{u}_{r}} = {{f}_{q}},\quad q = 1,\;2,\; \ldots ,\;P, \\ A = {\text{\{}}{{A}_{{q,r}}}{\text{\}}},\quad {{A}_{{q,r}}} \in {{\mathcal{R}}^{{{{N}_{q}},{{N}_{r}}}}},\quad {{f}_{q}} \in {{\mathcal{R}}^{{Nq}}},\quad {{N}_{1}} + \; \ldots \; + {{N}_{P}} = N, \\ \end{gathered} $$
(2)

where \({{\Omega }_{q}}\) is the set of indices of matrix rows forming the \(q\)th block row of the matrix \(A\) and \(P\) is its block order.

One of important block structures gives saddle algebraic systems

$$Au \equiv \left[ {\begin{array}{*{20}{c}} D&{{{C}^{{\text{T}}}}} \\ C&0 \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{{u}_{1}}} \\ {{{u}_{2}}} \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} {{{f}_{1}}} \\ 0 \end{array}} \right] = f,\quad D \in {{\mathcal{R}}^{{{{N}_{1}},{{N}_{1}}}}},\quad C \in {{\mathcal{R}}^{{{{N}_{2}},{{N}_{1}}}}};\quad {{u}_{1}},{{f}_{1}} \in {{\mathcal{R}}^{{{{N}_{1}}}}},$$
(3)

which arise from the approximation of mixed statements of boundary value problems and in optimization methods. The matrix \(A\) in (3) is invertible if the following conditions are fulfilled:

$${\text{rank}}(D) = {{N}_{1}},\quad {\text{rank}}(C) = {{N}_{2}},\quad \ker(D) \cap \ker(C) = {\text{\{}}0{\text{\}}},\quad {{N}_{2}} \leqslant {{N}_{1}}.$$

Along with the original matrix, we will consider preconditioning matrices, which we will denote by the symbol \(B\) with super or subscripts. The choice of preconditioners is based on efficient invertibility and improvement of conditioning of preconditioned matrices of form \({{B}^{{ - 1}}}A\) or \(A{{B}^{{ - 1}}}\), which together should improve the performance of the algorithm software implementation.

A general representation of the stationary iterative process for solving SLAE (1) is the so-called first canonical form

$${{u}^{{n + 1}}} = {{u}^{n}} + {{B}^{{ - 1}}}{{r}^{n}},\quad {{r}^{n}} = f - A{{u}^{n}},$$

where \({{r}^{n}}\) is the residual vector. It is associated with the second canonical form

$${{u}^{{n + 1}}} = T{{u}^{n}} + g,\quad T = (1 - {{B}^{{ - 1}}}A),\quad g = - {{B}^{{ - 1}}}f,$$

where \(T\) is called the transition matrix. A criterion for stopping the iterative process is the fulfillment of the following conditions (for details see [25, 26])

$$\left\| {{{r}^{n}}} \right\| \leqslant {{\varepsilon }_{1}}\left\| f \right\| + {{\varepsilon }_{2}}\left\| A \right\|\left\| {{{u}^{n}}} \right\|,\quad 0 < {{\varepsilon }_{1}},\;{\\ker n 1pt} {{\varepsilon }_{2}} \ll 1;$$

however, it is often assumed that \({{\varepsilon }_{2}} = 0\) for simplicity. The issue of estimating the iterative approximation error \(\left\| {u - {{u}^{n}}} \right\|\) is nontrivial, especially if machine arithmetic precision is taken into account. This issue is considered in a large number of studies (see [8] and references therein).

2.1 Algorithms for Symmetric SLAEs

The inception and the first period of development of iterative methods in Krylov subspaces, which may be dated to 1950–1965, should be attributed to the authors of pioneering works by Lanczos (1950, 1952), Arnoldi (1951), Hestsenes and Stiefel (1951, 1952), Kreig (1954, 1955), Vorob’ev (1958), Faddeev and Faddeeva (1958, 1960), Fridman (1962), Golub (1965), and Varga (1962). The list of subsequent studies is in the hundreds.

We begin the consideration with algorithms for solving symmetric SLAEs with the purpose of giving a uniform view at this intensively developing area of research and discuss important issues of computational performance of methods.

Let the iterative approximation \({{u}^{n}}\) and the corresponding residual vector \({{r}^{n}} = f - A{{u}^{n}}\) be calculated by the formulas

$$\begin{gathered} {{u}^{n}} = {{u}^{{n - 1}}} + {{\alpha }_{{n - 1}}}{{p}^{{n - 1}}} = {{u}^{0}} + {{\alpha }_{0}}{{p}^{0}} + \; \ldots \; + {{\alpha }_{{n - 1}}}{{p}^{{n - 1}}}, \\ {{r}^{n}} = {{r}^{{n - 1}}} - {{\alpha }_{{n - 1}}}A{{p}^{{n - 1}}} = {{r}^{0}} - {{\alpha }_{0}}A{{p}^{0}} - \; \ldots \; - {{\alpha }^{{n - 1}}}A{{p}^{{n - 1}}}, \\ \end{gathered} $$
(4)

where \({{\alpha }_{k}}\) and \({{p}^{k}}\) are iteration parameters and direction vectors. In this paper, we describe the family of conjugate direction algorithms characterized by various variational and projective properties determined by the corresponding scalar products and norms of vectors

$${{(u,{v})}_{\gamma }} = ({{A}^{\gamma }}u,{v}),\quad \left\| u \right\|_{\gamma }^{2} = {{(u,u)}_{\gamma }},$$
(5)

where the exponent is \(\gamma = 0,\;1,\;2\). Let the direction vectors be \({{A}^{\gamma }}\)-orthogonal, i.e.,

$$({{A}^{\gamma }}{{p}^{n}},{{p}^{k}}) = \rho _{k}^{{(\gamma )}}{{\delta }_{{n,k}}},\quad \rho _{k}^{{(\gamma )}} = ({{A}^{\gamma }}{{p}^{k}},{{p}^{k}}) = \left\| {{{p}^{k}}} \right\|_{\gamma }^{2},$$
(6)

where \({{\delta }_{{n,k}}}\) is the Kronecker symbol. Then, the residuals \({{r}^{n}}\) satisfy the relations

$${{\Phi }_{\gamma }}({{r}^{n}}) = ({{A}^{{\gamma - 2}}}{{r}^{n}},{{r}^{n}}) = {{({{r}^{0}},{{r}^{0}})}_{{\gamma - 2}}} - \sum\limits_{k = 0}^{n - 1} {[2{{\alpha }_{k}}({{r}^{0}},{{A}^{{\gamma - 1}}}{{p}^{k}}) - \alpha _{k}^{2}{{\rho }_{k}}]} {\kern 1pt} .$$
(7)

This implies that, at the values of the parameters

$${{\alpha }_{k}} = {{\sigma }_{k}}{\text{/}}{{\rho }_{k}},\quad {{\sigma }_{k}} = {{({{r}^{0}},{{p}^{k}})}_{{\gamma - 1}}} = {{({{r}^{k}},{{p}^{k}})}_{{\gamma - 1}}},$$
(8)

the functionals \({{\Phi }_{\gamma }}\) take the minimum values

$${{\Phi }_{\gamma }}({{r}^{n}}) = {{\left\| {{{r}^{0}}} \right\|}_{{\gamma - 2}}} - \sum\limits_{k = 0}^{n - 1} {{{({{r}^{0}},{{p}^{k}})_{{\gamma - 1}}^{2}} \mathord{\left/ {\vphantom {{({{r}^{0}},{{p}^{k}})_{{\gamma - 1}}^{2}} {{{\rho }_{k}}}}} \right. \kern-0em} {{{\rho }_{k}}}}} .$$
(9)

Here and below, the index \(\gamma \) of the coefficients and vectors is dropped for simplicity, and the matrix \(A\) is assumed to be nonsingular for the time being.

There are two ways to satisfy conditions (6). The first one is to use the Lanczos \({{A}^{\gamma }}\)-orthogonalization, which can be written as

$$\begin{gathered} {{p}^{0}} = 0,\quad {{{\bar {\beta }}}_{1}}{{p}^{1}} = f,\quad k = 1,\;2,\; \ldots ,\;n: \\ \mathop {\bar {\beta }}\nolimits_{n + 1} {{p}^{{n + 1}}} = A{{p}^{n}} - \mathop {\bar {\alpha }}\nolimits_n {{p}^{n}} - \mathop {\bar {\beta }}\nolimits_n {{p}^{{n - 1,}}},\quad \mathop {\bar {\alpha }}\nolimits_n = {{({{p}^{n}},{{p}^{n}})}_{{\gamma + 1}}}, \\ \end{gathered} $$
(10)

where \(\mathop {\bar {\beta }}\nolimits_n \) are chosen from the condition \({{\left\| {{{p}^{n}}} \right\|}_{\gamma }} = 1\). Another way (by Hestenes and Stiefel), which is most widespread due to its stability, uses the two-term recurrence

$${{p}^{0}} = {{r}^{0}},\quad {{p}^{n}} = {{r}^{n}} + {{\beta }_{{n - 1}}}{{p}^{{n - 1}}},\quad {{\beta }_{{n - 1}}} = ({{A}^{\gamma }}{{r}^{n}},{{p}^{{n - 1}}}){\text{/}}{{\rho }_{{n - 1}}},$$
(11)

which, in combination with the formula for the residual in (4), yields other relations for the direction vectors:

$${{p}^{n}} = (1 + {{\beta }_{{n - 1}}}){{p}^{{n - 1}}} - {{\alpha }_{{n - 1}}}A{{p}^{{n - 1}}} - {{\beta }_{{n - 2}}}{{p}^{{n - 2}}}.$$
(12)

Using (4) and (11), we can obtain the property of \({{A}^{{\gamma - 1}}}\)-orthogonality of residual vectors and more convenient (for \(\gamma = 1,\;2)\) formulas for the iteration coefficients

$$({{A}^{{\gamma - 1}}}{{r}^{n}},{{r}^{k}}) = {{\sigma }_{k}}{{\delta }_{{n,k}}},\quad {{\sigma }_{k}} = \left\| {{{r}^{n}}} \right\|_{{\gamma - 1}}^{2},\quad {{\beta }_{k}} = {{\sigma }_{{k + 1}}}{\text{/}}{{\sigma }_{k}}.$$
(13)

However, \({{\alpha }_{k}}\) in the case \(\gamma = 0\) must be calculated in a different way. Since the exact solution of the SLAE can be represented as the expansion in the basis

$$u = {{u}^{0}} + {{\alpha }_{0}}{{p}^{0}} + \; \ldots \; + {{\alpha }_{{m - 1}}}{{p}^{{m - 1}}},\quad m \leqslant N,$$

the vector error of the iterative approximation and the corresponding residual are written as

$$\begin{gathered} {{{v}}^{n}} = u - {{u}^{n}} = {{\alpha }_{n}}{{p}^{n}} + \; \ldots \; + {{\alpha }_{m}}{{p}^{m}}, \\ {{r}^{n}} = A{{{v}}^{n}} = {{\alpha }_{n}}A{{p}^{n}} + \ldots \; + {{\alpha }_{m}}A{{p}^{m}}. \\ \end{gathered} $$
(14)

Hence, using the \({{A}^{\gamma }}\)-orthogonalization of the vectors \({{p}^{k}}\), we have

$${{\alpha }_{n}} = {{({{{v}}^{n}},{{p}^{n}})}_{\gamma }}{\text{/}}\left\| {{{p}^{n}}} \right\|_{\gamma }^{2} = - {{\alpha }_{{n - 1}}}({{{v}}^{n}},A{{p}^{{n - 1}}}){\text{/}}\left\| {{{p}^{n}}} \right\|_{\gamma }^{2} = - {{\alpha }_{{n - 1}}}{{({{r}^{n}},{{p}^{{n - 1}}})}_{\gamma }}{\text{/}}\left\| {{{p}^{n}}} \right\|_{\gamma }^{2}.$$
(15)

Here we used the orthogonality of \({{p}^{n}}\) to the vectors \({{p}^{{n - 1}}}\) and \({{p}^{{n - 2}}}\) and its relation to the vector \(A{{p}^{{n - 1}}}\) in (12). The coefficient \({{\beta }_{{n - 1}}}\) must be calculated by formula (11).

The iterative processes of interest possess optimal properties thus ensuring the minimization of the corresponding functionals \({{\Phi }_{\gamma }}({{r}^{n}})\) of form (7) in Krylov subspaces

$${{\mathcal{K}}_{n}}({{r}^{0}},A) = {\text{Span}}({{r}^{0}},A{{r}^{0}},\; \ldots ,\;{{A}^{{n - 1}}}{{r}^{0}}).$$
(16)

In the case \(\gamma = 0,\;1,\;2\), these algorithms are called minimum iteration or minimum error methods (see [8]) and conjugate gradients and conjugate residuals, respectively [5, 9, 27].

The approaches described above admit simple generalization for the SLAEs preconditioned using some s.p.d. matrices \(B\). To preserve the symmetry of systems, this should be done by two-sided preconditioning with the help of the formally introduced matrix \({{B}^{{1/2}}}\). As a result, system (1) takes the form

$$\bar {A}\bar {u} = \bar {f},\quad \bar {A} = {{B}^{{ - 1/2}}}A{{B}^{{ - 1/2}}},\quad \bar {u} = {{B}^{{1/2}}}u,\quad \bar {f} = {{B}^{{ - 1/2}}}f.$$
(17)

As a result of applying the conjugate directions formulas to SLAE (17), we obtain after certain transformations the following iterative process for the case \(\gamma = 1,2\):

$$\mathop {\hat {p}}\nolimits^0 = {{\hat {r}}^{0}} = {{B}^{{ - 1}}}{{r}^{0}} = {{B}^{{ - 1}}}(f - A{{u}^{0}}),\quad n = 1,\;2,\;...,$$
$$\begin{gathered} {{u}^{n}} = {{u}^{{n - 1}}} + {{\alpha }_{{n - 1}}}\mathop {\hat {p}}\nolimits^{n - 1} ,\quad \mathop {\hat {r}}\nolimits^n = \mathop {\hat {r}}\nolimits^{n - 1} - {{\alpha }_{{n - 1}}}A\mathop {\hat {p}}\nolimits^{n - 1} , \\ \mathop {\hat {p}}\nolimits^n = {{{\hat {r}}}^{n}} + {{\beta }_{{n - 1}}}\mathop {\hat {p}}\nolimits^{n - 1} ,\quad {{\alpha }_{{n - 1}}} = {{\sigma }_{{n - 1}}}{\text{/}}{{\rho }_{{n - 1}}},\quad {{\beta }_{{n - 1}}} = {{\sigma }_{n}}{\text{/}}{{\sigma }_{{n - 1}}}, \\ \end{gathered} $$
(18)
$${{\sigma }_{{n - 1}}} = ({{A}^{{\gamma - 1}}}\mathop {\hat {r}}\nolimits^{n - 1} ,{{\hat {r}}^{{n - 1}}}),\quad {{\rho }_{{n - 1}}} = ({{B}^{{ - 1}}}A\mathop {\hat {p}}\nolimits^{n - 1} ,{{A}^{{\gamma - 1}}}\mathop {\hat {p}}\nolimits^{n - 1} ),$$

where the new vectors are related to the old ones by the relations \(\mathop {\hat {p}}\nolimits^n = {{B}^{{ - 1}}}{{p}^{n}}\), \({{\hat {r}}^{n}} = {{B}^{{ - 1}}}{{r}^{n}}\). For the minimum iteration method with \(\gamma = 0\), the parameters \({{\alpha }_{{n - 1}}}{\text{ and }}{{\beta }_{{n - 1}}}\) should be calculated by formulas (15) and (11) in which \({{r}^{k}}{\text{ and }}{{p}^{k}}\) are replaced by \({{\bar {r}}^{k}} = {{B}^{{ - 1}}}{{r}^{k}}\) and \(\mathop {\bar {p}}\nolimits^k = {{B}^{{ - 1}}}{{p}^{k}}\), respectively.

Remark 1. In practice, it is often required to solve a series of SLAEs with the same matrix but successively computed different right-hand sides. In this case, a significant amount of computations can be avoided by memorizing the calculated vectors \({{p}^{k}}\) and \(A{{p}^{k}}\) after solving the first system and somehow using them for expanding the Krylov basis when solving the subsequent SLAEs. These issues are discussed in a large number of papers (see [28] and the list of 157 works therein). However, the significant improvement of “informativeness” of the basis of iterative processes requires further study.

Remark 2. In book [8], the following methodological circumstance was mentioned: Methods in Krylov subspaces are closely related to the problem of moments, which plays an important role in the theory of operators and many applications. The beginnings of this topic were laid by Chebyshev, Markov, and Stieltjes.

In many applications, algebraic systems with symmetric matrices that have a sign definite spectrum are of interest. Such SLAEs can be singular, in particular, consistent or inconsistent. Inconsistent systems have a generalized, rather than classical, solution in the sense of least squares. For these cases, the normal (or pseudoinverse) solution is defined that has the minimum norm and minimizes the residual:

$$\min{{\left\| u \right\|}_{2}}:u = {\text{argmin}}{{\left\| {f - Au} \right\|}_{2}}.$$
(19)

Such a solution always exists and is unique, and its formal representation is obtained using the left Gauss transformation of the original system \(({{A}^{{\text{T}}}}Au = {{A}^{{\text{T}}}}f)\); this representation is

$${{u}^{ + }} = {{({{A}^{{\text{T}}}}A)}^{ + }}f = {{({{A}^{2}})}^{ + }}Af,$$
(20)

where \({{A}^{ + }}\) is the psedouinverse or generalized inverse matrix of \(A\).

For solving such SLAEs, Saunders with his colleagues (see review [29]) developed the methods SYMMLQ, MINRES, and MINRES–QLP based on the Lanczos \(A\)-orthogonalization, which, due to recurrences (10), can be written in matrix form

$$A{{P}_{k}} = {{P}_{{k + 1}}}{{T}_{k}},\quad {{T}_{k}} = \left[ {\begin{array}{*{20}{c}} {\mathop {\bar {\alpha }}\nolimits_1 }&{\mathop {\bar {\beta }}\nolimits_1 }&{}&{} \\ {\mathop {\bar {\beta }}\nolimits_2 }&{\mathop {\bar {\alpha }}\nolimits_2 }&{}&{} \\ {}& \ddots & \ddots &{\mathop {\bar {\beta }}\nolimits_k } \\ {}&{}&{\mathop {\bar {\beta }}\nolimits_k }&{\mathop {\bar {\alpha }}\nolimits_k } \\ {}&{}&{}&{\mathop {\bar {\beta }}\nolimits_{k + 1} } \end{array}} \right],$$
(21)

where \({{P}_{k}} = [{{p}^{0}},{{p}^{1}},\; \ldots ,\;{{p}^{{k - 1}}}]\). Then, the approximate solution is sought in the form \({{u}^{k}} = {{P}_{k}}{{q}^{k}}\) and is reduced to finding a vector \({{q}^{k}} \in {{\mathcal{R}}^{k}}\) that minimizes the residual

$${{r}^{k}} = f - A{{P}_{k}}{{q}^{k}} = {{\beta }_{1}}{{p}^{1}} - {{P}_{{k + 1}}}{{T}_{k}}{{y}^{k}}.$$
(22)

This problem is solved using the LQ- or QR-transformations of the tridiagonal matrix \({{T}_{k}}\) with the orthogonal matrix \(Q\) on the basis of stable Hausholder reflection operations. The results of representative numerical experiments presented by the authors demonstrate good efficiency of the algorithms, and their software implementations are available on the Internet.

The convergence rate of iterations in Krylov-type methods for positive semidefinite SLAEs is characterized by the bound on the number of iterations \(n(\varepsilon )\) required for suppressing the initial error by a factor of \({{\varepsilon }^{{ - 1}}}\):

$$n(\varepsilon ) \leqslant \frac{1}{2}\sqrt \kappa {{\log}_{e}}\frac{2}{\varepsilon },\quad \kappa = {{\lambda }_{{\max}}}{\text{/}}{{\lambda }_{{\min}}};$$

here \(\kappa \) is generally the effective condition number, which for singular matrices is represented in terms of the minimum nonzero eigenvalue \({{\lambda }_{{\min}}}\).

For symmetric SLAEs, the problem of stable numerical solution was basically solved in the 20th century, including the case of sign indefinite, singular, and inconsistent systems. As for the classical conjugate gradient method, we want to mention Kaporin’s ingenious theory of convergence [30] based on the condition \(K\)-number introduced by him; this number is expressed in terms of the trace and determinant of the matrix.

More specifically, consider the preconditioned conjugate gradient method in the following form:

$${{r}^{0}} = f - A{{u}^{0}},\quad {{p}^{0}} = {{B}^{{ - 1}}}{{r}^{n}},\quad n = 0,\;1,\; \ldots ,$$
$${{u}^{{n + 1}}} = {{u}^{n}} + {{\alpha }_{n}}{{p}^{n}},\quad {{r}^{{n + 1}}} = {{r}^{n}} - {{\alpha }_{n}}A{{p}^{n}},$$
$${{p}^{{n + 1}}} = {{B}^{{ - 1}}}{{r}^{{n + 1}}} + {{\beta }_{n}}{{p}^{n}},\quad {{\alpha }_{n}} = {{\sigma }_{n}}{\text{/}}{{\rho }_{n}},$$
$${{\beta }_{n}} = {{\sigma }_{{n + 1}}}{\text{/}}{{\sigma }_{n}},\quad {{\sigma }_{n}} = ({{r}^{n}},{{B}^{{ - 1}}}{{r}^{n}}),\quad {{\rho }_{n}} = ({{p}^{0}},A{{p}^{n}}).$$

In this case, it was proved in [31] that the condition

$$({{r}^{n}},{{B}^{{ - 1}}}{{r}^{n}}) \leqslant {{\varepsilon }^{2}}({{r}^{1}},{{B}^{{ - 1}}}{{r}^{n}})$$

is fulfilled if the number of iterations is

$$n(\varepsilon ) = {{\log}_{2}}K + {{\log}_{2}}{{\varepsilon }^{{ - 1}}},$$

where \(K\) is the condition K-number of the preconditioned matrix \({{B}^{{ - 1}}}A\) defined by

$$K = K({{B}^{{ - 1}}}A) = {{{{{\left[ {\frac{1}{N}\operatorname{trace} ({{B}^{{ - 1}}}A)} \right]}}^{N}}} \mathord{\left/ {\vphantom {{{{{\left[ {\frac{1}{N}\operatorname{trace} ({{B}^{{ - 1}}}A)} \right]}}^{N}}} {\det ({{B}^{{ - 1}}}A)}}} \right. \kern-0em} {\det ({{B}^{{ - 1}}}A)}}.$$

In [32], iterative algorithms were optimized for a number of preconditioners so as to minimize the condition K-number.

2.2 Krylov Algorithms for Asymmetric SLAEs

We continue the description of specific approaches with a wide class of multipreconditioned semiconjugate direction (SCD) algorithms [31]. In the general block form, such iterative methods in Krylov subspaces are written as

$${{r}^{0}} = f - A{{u}^{0}},\quad n = 0,\; \ldots :\quad {{u}^{{n + 1}}} = {{u}^{n}} + {{P}_{n}}{{\bar {\alpha }}_{n}},$$
$${{r}^{{n + 1}}} = {{r}^{n}} - A{{P}_{n}}{{\bar {\alpha }}_{n}} = {{r}^{q}} - A{{P}_{q}}{{\bar {\alpha }}_{q}} - \; \cdots \; - A{{P}_{n}}{{\bar {\alpha }}_{n}},\quad 0 \leqslant q \leqslant n,$$
(23)
$${{P}_{n}} = (p_{1}^{n},\; \ldots ,\;p_{{{{M}_{n}}}}^{n}) \in {{\mathcal{R}}^{{N,{{M}_{n}}}}},\quad {{\bar {\alpha }}_{n}} = {{({{\alpha }_{{n,1}}},\; \ldots ,\;{{\alpha }_{{n,{{M}_{n}}}}})}^{{\text{T}}}} \in {{\mathcal{R}}^{{{{M}_{n}}}}}.$$

Here \(p_{1}^{n},\; \ldots ,\;p_{{{{M}_{n}}}}^{n}\) are the direction vectors constituting at the \(n\)th iteration the matrix \({{P}_{n}}\), and \({{\bar {\alpha }}_{n}}\) is the vector of iteration parameters. For the vectors \(p_{k}^{n}\) in (23), we assume for the time being that only the orthogonality conditions

$$\begin{gathered} (Ap_{k}^{n},{{A}^{\gamma }}p_{{k'}}^{{n'}}) = \rho _{{n,k}}^{{(\gamma )}}\delta _{{n,n'}}^{{k,k'}},\quad \rho _{{n,k}}^{{(\gamma )}} = (Ap_{k}^{n},{{A}^{\gamma }}p_{k}^{n}), \\ \gamma = 0,\;1,\quad n' = 0,\;1,\; \ldots ,\;n - 1,\quad k,k' = 1,\;2,\; \ldots ,\;{{M}_{n}} \\ \end{gathered} $$
(24)

are fulfilled. However, if we define the coefficients \({{\bar {\alpha }}_{n}} = {\text{\{}}{{\alpha }_{{n,l}}}{\text{\}}}\) by the formulas

$${{\alpha }_{{n,l}}} = {{\sigma }_{{n,l}}}{\text{/}}\rho _{{n,n}}^{{(\gamma )}},\quad {{\sigma }_{{n,l}}} = ({{r}^{0}},{{A}^{\gamma }}\bar {p}_{l}^{n}),$$
(25)

then (23) implies the following expressions for the residual functionals:

$$\Phi _{n}^{{(\gamma )}}({{r}^{{n + 1}}}) \equiv ({{r}^{{n + 1}}},{{A}^{{\gamma - 1}}}{{r}^{{n + 1}}}) = ({{r}^{q}},{{A}^{{\gamma - 1}}}{{r}^{q}}) - \sum\limits_{k = q}^n {\sum\limits_{l = 1}^{{{M}_{n}}} {{{{{{({{r}^{q}},{{A}^{\gamma }}p_{l}^{k})}}^{2}}} \mathord{\left/ {\vphantom {{{{{({{r}^{q}},{{A}^{\gamma }}p_{l}^{k})}}^{2}}} {\rho _{{k,l}}^{{(\gamma )}}}}} \right. \kern-0em} {\rho _{{k,l}}^{{(\gamma )}}}}} } ,\quad q = 0,\;1,\; \ldots ,\;n;$$
(26)

they attain their minimums in the block Krylov subspaces

$${{K}_{M}} = \operatorname{Span} \{ p_{1}^{0},\; \ldots ,\;p_{{{{M}_{0}}}}^{0},Ap_{1}^{1},\; \ldots ,\;Ap_{{{{M}_{1}}}}^{1},\; \ldots ,\;Ap_{1}^{n},\; \ldots ,\;Ap_{{{{M}_{n}}}}^{n}\} ,\quad M = {{M}_{0}} + {{M}_{1}} + \; \cdots \; + {{M}_{n}},$$
(27)

at \(\gamma = 1\), and in the case when the matrix \(A\) is skew symmetric also at \(\gamma = 0\). Note that in the case of asymmetric matrices and \(\gamma = 0\), the application of algorithms is actually restricted to SLAEs with the matrices that have a positive definite symmetric part \({{A}_{s}} = 0.5(A + {{A}^{{\text{T}}}})\) because, e.g., for skew symmetric matrices, \((Au,u) = 0\) for \(u = {\text{\{}}1{\text{\}}}\).

The orthogonality property of the direction vectors (24) can be ensured if they are defined using “multipreconditioned” recurrences in which each vector \(p_{l}^{{n + 1}}\) is assigned a specific preconditioning matrix \({{B}_{{n + 1,l}}}\):

$$p_{l}^{0} = B_{{0,l}}^{{ - 1}}{{r}^{0}},\quad p_{l}^{{n + 1}} = B_{{n + 1,l}}^{{ - 1}}{{r}^{{n + 1}}} - \sum\limits_{k = 0}^n {\sum\limits_{l = 1}^{{{M}_{k}}} {\beta _{{n,k,l}}^{{(\gamma )}}} } p_{l}^{k},\quad n = 0,\;1,\; \ldots ,$$
$$\begin{gathered} {{B}_{{n,l}}} \in {{\mathcal{R}}^{{N,N}}},\quad i = 1,\; \ldots ,\;{{M}_{n}};\quad \gamma = 0,\;1,\;2, \\ \bar {\beta }_{{n,k}}^{{(\gamma )}} = \{ \beta _{{n,k,l}}^{\gamma }\} = {{(\beta _{{n,k,1}}^{{(\gamma )}}\; \ldots \;\beta _{{n,k,{{M}_{n}}}}^{{(\gamma )}})}^{{\text{T}}}} \in {{\mathcal{R}}^{{{{M}_{n}}}}}, \\ \end{gathered} $$
(28)
$$\beta _{{n,k,l}}^{{(\gamma )}} = - {{({{A}^{\gamma }}p_{l}^{k},AB_{{n + 1,l}}^{{ - 1}}{{r}^{{n + 1}}})} \mathord{\left/ {\vphantom {{({{A}^{\gamma }}p_{l}^{k},AB_{{n + 1,l}}^{{ - 1}}{{r}^{{n + 1}}})} {\rho _{{n,l}}^{\gamma }}}} \right. \kern-0em} {\rho _{{n,l}}^{\gamma }}},\quad n = 0,\;1,\; \ldots ,\quad k = 0,\;1,\; \ldots ,\;n,\quad l = 1,\;2,\; \ldots ,\;{{M}_{n}}.$$

If the matrix \(A\) is symmetric, then the recurrence data turn from long ones into two-term data and we obtain the conjugate gradient and conjugate residual methods (for \(\gamma = 0,\;1\), respectively, in multipreconditioned and classical versions). For asymmetric SLAEs, these algorithms are called semi-conjugate gradient or semi-graduate residual (SCG or SCR) methods. In block (multipreconditioned) conjugate direction methods for solving symmetric SLAEs, formulas (23)(25) remain the same, and the direction matrices \({{P}_{n}}\) are calculated using the two-term recurrences

$${{P}_{{n + 1}}} = {{Q}_{{n + 1}}} + {{P}_{n}}\bar {\beta }_{n}^{{(\gamma )}},\quad {{Q}_{{n + 1}}} = \{ q_{l}^{{n - 1}} = B_{{n + 1,l}}^{{ - 1}}{{r}^{{n + 1}}}\} \in {{\mathcal{R}}^{{N,{{M}_{{n + 1}}}}}},$$
$$\bar {\beta }_{n}^{{(\gamma )}} = \{ \beta _{{n,l}}^{{(\gamma )}} = \beta _{{n,n,l}}^{{(\gamma )}}\} \in {{\mathcal{R}}^{{{{M}_{n}}}}}.$$

These formulas, as well as (28), contain “built-in” preconditioning matrices. If we set \({{B}_{{n + 1,l}}} = I\) and \({{M}_{n}} = 1\) for all \(n\), then we obtain the Krylov process in its “pure form” without preconditioning. Note that formulas (28) implement the Gram–Schmidt orthogonalization algorithm, the stability of which can be improved by using the modified Gram–Schmidt (MGS) orthogonalization method [5, 33].

A feature of the algorithms considered here when they are applied to solving ill-conditioned asymmetric SLAEs is their high demand for resources in the sense of the amount of computations and memory when the number of iterations is large. A means for mitigating this drawback is to reduce the number of stored and used direction vectors, which can be achieved using two approaches. The first one is to reduce the recurrence by taking into account only its last \(m\) vectors. The second approach is to periodically restart the process after each \(m\) iterations, i.e., calculate the residual vector from the original equation as at the zero iteration rather than from the recurrence formula:

$${{r}^{{{{n}_{t}}}}} = f - A{{u}^{{{{n}_{t}}}}},\quad {{n}_{t}} = mt,\quad t = 0,\;1,\; \ldots ;$$
(29)

here \(t\) is the restart index (the further computations up to \(n = {{n}_{{t + 1}}}\) are performed using recurrence (23)). Both approaches significantly slow down the iterative process.

To eliminate this stagnant effect, it is proposed to add a second level of iterations using the least squares method (LSM) [34]. Suppose that we know the restart approximations \({{u}^{{{{n}_{0}}}}},{{u}^{{{{n}_{1}}}}},\; \ldots ,\;{{u}^{{{{n}_{t}}}}}\), \({{n}_{0}} = 0\). Then to correct the iteration vector \({{u}^{{{{n}_{t}}}}}\), which is the initial one for the next restart period of method (23)–(29), we will use the linear combination

$$\begin{gathered} {{{\hat {u}}}^{{{{n}_{t}}}}} = {{u}^{{{{n}_{t}}}}} + {{b}_{1}}{{{v}}_{1}} + \; \ldots \; + {{b}_{t}}{{{v}}_{t}} = {{u}^{{{{n}_{t}}}}} + {{{v}}^{{{{n}_{t}}}}},\quad {{{v}}^{{{{n}_{t}}}}} = {{V}_{t}}\bar {b},\quad \bar {b} = {{({{b}_{1}},\; \ldots ,\;{{b}_{t}})}^{{\text{T}}}}, \\ {{V}_{t}} = \{ {{{v}}_{k}} = {{u}^{{{{n}_{k}}}}} - {{u}^{{{{n}_{{k - 1}}}}}},\;k = 1,\; \ldots ,\;t\} \in {{\mathcal{R}}^{{N,t}}}, \\ \end{gathered} $$
(30)

in which the vector of coefficients \(\bar {b}\) is found by minimizing the norm \(\left\| {{{r}^{{{{u}_{t}}}}}} \right\|\) of the residual by solving the overdetermined algebraic system

$${{W}_{t}}\bar {b} = {{r}^{{{{n}_{t}}}}} \equiv f - A{{u}^{{{{n}_{t}}}}},\quad {{W}_{t}} = A{{V}_{t}}.$$
(31)

This SLAE can be solved, e.g., using the \(QR\)- or \(SVD\)-decompositions of the matrix \({{W}_{t}}\). The normal solution with the minimum norm \(\left\| {\bar {b}} \right\|\) is found by applying to (31) the left Gauss transformation

$$W_{t}^{{\text{T}}}{{W}_{t}}\bar {b} = {{W}_{t}}{{r}^{{{{n}_{t}}}}}.$$

A simpler SLAE in the sense of its decreased condition number is obtained by multiplying system (31) by the matrix \(V_{t}^{{\text{T}}}\) on the left:

$${{C}_{t}}\bar {b} \equiv V_{t}^{{\text{T}}}A{{V}_{t}}\bar {b} = V_{t}^{{\text{T}}}{{r}^{{{{n}_{t}}}}}.$$
(32)

If \({{V}_{t}}\) is a full rank matrix, then the matrices \(A\) and \({{C}_{t}}\) are simultaneously nonsingular. In this case, we have for the correction vector in (30) the formula \({{{v}}^{{{{n}_{t}}}}} = {{B}_{t}}{{r}^{{{{n}_{t}}}}} \equiv {{V}_{t}}{{(V_{t}^{{\text{T}}}A{{V}_{t}})}^{{ - 1}}}V_{t}^{{\text{T}}}{{r}^{{{{n}_{t}}}}}\), where \({{B}_{t}} = {{V}_{t}}{{\hat {A}}^{{ - 1}}}V_{t}^{{\text{T}}}\) (\(\hat {A} = V_{t}^{{\text{T}}}AV\)) is a low-rank approximation of the matrix \({{A}^{{ - 1}}}\). In this approach, all restart vectors are stored in corrected form, and the corresponding residuals are found by the formula \({{r}^{{{{n}_{t}}}}} = f - A{{u}^{{{{n}_{t}}}}}\). If a matrix under consideration is not invertible, then a generalized inverse matrix is used. Numerous experiments with the use of LSM for accelerating Krylov processes with restarts show its high efficiency.

One more possibility of improving the performance of SCD methods with restarts is to store all direction vectors \({{p}^{n}}\) and the vectors \(A{{p}^{n}}\) obtained during the first restart period and use them in the computations in the subsequent restart periods rather than calculate new vectors \({{p}^{n}}\) and \(A{{p}^{k}}\).

The class of semiconjugate direction method with dynamic multipreconditioning is equivalent in terms of convergence rate to other known algorithms of solving asymmetric SLAEs in Krylov subspaces, among which the most popular one is the generalized minimal residual method GMRES based on Arnoldi orthogonalization and its various versions. This algorithm was proposed by Saad and Schultz in 1986 [5], and it received wide and well-deserved popularity. Among numerous studies of this method, we distinguish the results by Knizhnerman [35] on estimating the error of Arnoldi orthogonalization.

Another principle of constructing Krylov iterative processes for solving asymmetric SLAEs is based on constructing sequences of biorthogonal vectors. In this case, the direction vectors are calculated by short (two-term) recurrences, but two computations of vector-matrix product is required at each iteration step. The biorthogonalization ideas underlie the algorithms BiCG, CGS, and BiCGStab and their various modifications; later, their analogs with \(A\)-biorthogonalization of direction vectors were invented (see review in [36]). The closely related family of induced dimension reduction (IDR) (see [37] and references therein) is based on Sonneveld spaces. Among other approaches, we recommend methods of quasi-minimal residuals and least squares method (LSQR and LSMR) [38]. In recent years, algorithms based on bidiagonal Kreig–Golub–Kahan transformations has also revived (they were originally proposed in 1955 and 1965); their informative study, generalization, and comparative analysis can be found in [25, 39]. Finally, we note flexible conjugate gradient methods (FCG) [40], which are a generalization of the classical CG methods (including preconditioned ones) to the asymmetric case.

3 METHODS OF SLAE PRECONDITIONING

In this section, in the description of algorithms we will discuss at the qualitative level issues of their potential parallelization. A more detailed discussion of technological aspects determining the performance of parallel implementation will be given later.

In addition to the use of a preconditioning matrix considered in (17) and (18), this also can be done by direct preconditioning (left, right, or two-sided) of the original SLAE:

$$\bar {A}u = \bar {f},\quad \bar {A} = {{B}^{{ - 1}}}A,\quad \bar {f} = {{B}^{{ - 1}}}f,$$
$$\hat {A}\hat {u} = A{{B}^{{ - 1}}}Bu = f,\quad \hat {A} = A{{B}^{{ - 1}}},\quad \hat {u} = Bu,$$
(33)
$${\check{A}}\,{\check{u}}\,=B_{1}^{-1}AB_{2}^{-1}{{B}_{2}}u=B_{1}^{-1}f={\check{f}}\,,\quad {\check{A}}\,=B_{1}^{-1}AB_{2}^{-1},\quad {\check{u}}\,={{B}_{2}}u,$$

where \(B\), \({{B}_{1}}\), and \({{B}_{2}}\) are nonsingular matrices. Note that if \(A\) is symmetric, then to keep the matrix \({\check{A}}\,\) symmetric, one should set \({{B}_{1}} = B_{2}^{{\text{T}}}\) in the last case. Since the matrices \(\bar {A}\), \(\hat {A}\), or \({\check{A}}\,\) obtained from preconditioned algebraic systems are better conditioned, they should be solved using Krylov iteration formulas in pure form without additional preconditioning (although formally it is not prohibited, and multipreconditioning may be applied using a modification of both the iterative process (28) and the SLAE itself according to (33)).

Among the simple preconditioning methods, we mention scaling and binormalization of matrices. In the first method, a congruent diagonal transformation (\(\tilde {A} = DAD\)) is performed to obtain the unit principal diagonal; the second method is based on leveling the norms of rows and columns of the matrix (see review in [41, 42]). Such approaches may be used as preliminary ones in combination with other preconditioners.

For the block structure of SLAE (2), it seems the most natural to solve it using the Jacobi block method, which in a slightly generalized form is written as

$${{B}_{{q,q}}}u_{q}^{{n + 1}} \equiv ({{A}_{{q,q}}} + \theta {{D}_{q}})u_{q}^{{n + 1}} = {{f}_{q}} + \theta {{D}_{q}}u_{q}^{n} - \sum\limits_{r \in {{\Omega }_{q}}} {{{A}_{{q,r}}}} u_{r}^{n},\quad q = 1,\;2,\; \ldots ,\;P.$$
(34)

Here \(\theta \in [0,\;1]\) is an iteration (compensating) parameter, and \({{D}_{q}}\) is the diagonal matrix determined by the equation

$${{D}_{q}}e = \sum\limits_{r \in {{\Omega }_{q}}} {{{A}_{{q,r}}}} e,\quad e = {{(1,\; \ldots ,\;1)}^{{\text{T}}}} \in {{\mathcal{R}}^{{Nq}}},$$
(35)

which is called the compensation or filtering condition (sometimes, the introduction of matrix \({{D}_{q}}\) of form (35) is called lumping). In some cases, the choice of \(\theta \) can be justified and described theoretically, and it helps significantly accelerate the iterative process. There are generalizations of the compensation principle (35) based on a complication of the matrices \({{D}_{q}}\) with banded structure and on the use of several compensating (filtering) vectors [2, 9, 10]:

$${{D}_{q}}{{y}_{s}} = \sum\limits_{r \in {{\Omega }_{q}}} {{{A}_{{q,r}}}} {{y}_{s}},\quad s = 1,\;2,\; \ldots ,\;S;$$
(36)

however, this issue requires further study.

An alternative analog of the block Jacobi method, which belongs to the class of simultaneous displacement iterative or additive algorithms, is the Seidel method and its elaborations—successive over-relaxation and symmetric successive over-relaxation (SOR and SSOR) algorithms. The last method with some modification is also known under the names of alternate triangular, incomplete factorization, and explicit alternate direction scheme (see [2, 9]). These successive displacement or multiplicative methods (the new approximation \(u_{q}^{{n + 1}}\) can be made only after the preceding subvectors \(u_{1}^{{n + 1}},\; \ldots ,\;u_{{q - 1}}^{{n + 1}}\) have been found) have improved convergence rate, and the accelerating compensation principle (35), (36) can be applied to them. However, such algorithms are based on inverting triangular matrices, which is difficult to parallelize. For this reason, in the age of multiprocessor supercomputers, they turned out to be less popular, and we do not consider them in this paper. On the other hand, it was shown in [43] that the application of adaptive successive displacement downstream methods can significantly reduce the number of iterations.

Numerous preconditioning methods have theoretical justification and estimates of the convergence rate (see review in [4446]). A special case is the dynamic or flexible preconditioning with variable matrices [17, 4751]. In iterative processes applied to solving SLAEs arising from approximation of boundary value problems on a grid with the characteristic step \(h\) that have the condition number \(O({{h}^{{ - 2}}})\), its magnitude is usually reduced to \(O({{h}^{{ - 1}}})\) due to preconditioning. As a result, the number of iterations in Krylov-type methods is estimated by \(n(\varepsilon ) = O({{h}^{{ - 1/2}}})\). However, it should be taken into account that these asymptotic expressions can include large constants in “bad” problems with a large spread of values of the matrix \(A\) elements.

3.1 Domain Decomposition Algorithms

When grid SLAEs (especially of node type) are solved, each component of the vector to be found is associated with one grid node, and the structure and portrait of the matrix (the configuration of its nonzero elements) are isomorphic to the grid graph (which is directed for asymmetric matrixes and undirected for symmetric ones). In this case, the matrix block structure of form (2) is geometrically represented by the decomposition of the grid computation domain \(\Omega \) into grid subdomains \({{\Omega }_{q}}\) in each of which a subproblem algebraically corresponding to the diagonal block \({{A}_{{q,q}}}\) is formed. Historically, the priority in decomposition methods belongs to the theory of multidimensional boundary value problems since the 19th century, when the solution of differential equations in complex computational domains using the Schwarz alternating method was reduced to examining a sequence of auxiliary problems in overlapping subdomains using the Dirichlet conditions on the introduced internal boundaries. Later, such approaches were generalized on the continuous level in the works by Smelov, Lions, Matsokin, Widlund, Nataf with his colleagues, Vasilevski, Olshanskii, and many other researchers (see reviews of literature in [5, 7, 1719, 5257]), and they also were extended to discrete or grid level. In the age of multiprocessor supercomputers, decomposition methods became the main tool for parallelizing computations in the solution of multidimensional initial boundary problems with complex real-life data that require high resolution and speed of computations.

The DDMs may be classified according to three main characteristics. The first one is the dimension of decomposition depending on one, two, or three families of coordinate planes are used for decomposition. The second characteristic is the size of subdomain overlappings, which in a simple case are specified without overlapping, and parameterized overlappings are constructed on the grid level by the layer-by-layer expansion of each subdomain by elementary grid cells. The third direction of DDM development is based on generalizing the types of interface boundary conditions when iterating over subdomains.

On the whole, DDMs are implemented as a two-level iterative process in Krylov subspaces—the upper level is the additive Jacobi–Schwarz block method (34), which for simplicity in its stationary version can be written as

$${{u}^{{n + 1}}} = {{u}^{n}} + B_{1}^{{ - 1}}(f - A{{u}^{n}}),\quad {{u}^{n}} = \{ u_{q}^{n}\} ,\quad {{B}_{1}} = {\text{block}} - \operatorname{diag} \{ {{B}_{{q,q}}}\} ,$$
(37)

where \({{B}_{1}}\) is the corresponding preconditioner that will actually be used in the Krylov process. At each \(n\)th iteration (37), auxiliary problems are simultaneously or concurrently solved in the subdomains \({{\Omega }_{q}}\); formally, this solution is reduced to inverting the preconditioner blocks \({{B}_{{q,q}}}\). This can be done using direct and iterative methods among which it is natural to choose, in the case of large algebraic systems, preconditioned algorithms in the corresponding Krylov spaces. Note that in the last case we have the problem of solving multiple SLAEs with the same matrix but different right-hand sides, which can be efficiently done by reusing bases of Krylov sunspaces, as was mentioned in remark 1.

From the viewpoint of parallelizing computations on multiprocessors, it is preferable to decompose the computation domain into a greater number of subdomains \({{\Omega }_{q}}\) (\(q = 1,\;2,\; \ldots ,\;P\)) to be able to solve the corresponding subsystems synchronously on \(P\) processors. However, the number of external iterations will obviously grow in this case; this number can be estimated as \(O({{P}^{{1/d}}})\), where \(d\) is the geometric dimension of the domain (assuming that the decomposition forms a \(d\)-dimensional “macronet” consisting of subdomains). The number of external iterations can be reduced by increasing the size of adjacent subdomains overlapping; however, the implementation of each step will be costlier (in practice, the size of overlapping net subdomains has the size of several steps \(h\)).

To accelerate the Krylov-type iterations in DDMs, there are various approaches, which can be interpreted as the application of additional preconditioners, including smoothing preconditioners; for this reason, such methods are sometimes called hybrid (see [58, 59]). By way of example, consider the coarse-grid correction method based on the deflation principle of choosing an initial approximation \({{u}^{0}}\).

Let we have an information basis consisting of vectors \({{{v}}_{1}},\; \ldots ,\;{{{v}}_{m}}\), which compose the matrix \(V = ({{{v}}_{1}}\; \ldots \;{{{v}}_{m}}) \in {{\mathcal{R}}^{{N,m}}}\), \(m < N.\) By choosing an arbitrary vector \({{u}^{{ - 1}}}\), we will seek an initial approximation and the corresponding residual vector in the form

$${{u}^{0}} = {{u}^{{ - 1}}} + Vc,\quad {{r}^{0}} = {{r}^{{ - 1}}} - AVc,\quad {{r}^{{ - 1}}} = f - A{{u}^{{ - 1}}},$$

where the vector of unknown coefficients \(c = {{({{c}_{1}},\; \ldots ,\;{{c}_{m}})}^{{\text{T}}}}\) is determined by minimizing the residual \({{r}^{0}}\) from the overdetermined equation

$$AVc = {{r}^{{ - 1}}}.$$
(38)

By multiplying both sides of (38) on the left by \({{V}^{{\text{T}}}}{{A}^{\gamma }}\) with the parameter \(\gamma = 0\) or \(\gamma = 1\), we obtain the family of generalized solutions

$$c = {{({{V}^{{\text{T}}}}{{A}^{{\gamma + 1}}}V)}^{ + }}{{V}^{{\text{T}}}}{{A}^{\gamma }}{{r}^{{ - 1}}},$$

where the case \(\gamma = 1\) corresponds to the normal solution, which minimizes the Euclidean norms of the residual \({{r}^{0}} = {{r}^{{ - 1}}} - AVc\) and the solution itself \(c\). Hence, we obtain for the corrected residual

$${{V}^{{\text{T}}}}{{A}^{\gamma }}{{r}^{0}} = 0,\quad {{r}^{0}} = T{{r}^{{ - 1}}},\quad T = I - AV{{({{V}^{{\text{T}}}}{{A}^{{\gamma + 1}}}V)}^{{ - 1}}}{{V}^{{\text{T}}}}{{A}^{\gamma }},$$
(39)

which corresponds to the minimum of the functional

$${{\Phi }_{\gamma }}({{r}^{0}}) = ({{A}^{{\gamma - 1}}}{{r}^{0}},{{r}^{0}}).$$

It is easy to see that here the vectors \({{{v}}_{s}}\) play the role of direction vectors in Krylov subspaces, i.e., they provide the residual vector \({{r}^{0}}\) with orthogonal and variational properties similar to the properties in the conjugate gradient and conjugate residual methods at \(\gamma = 0\) and \(\gamma = 1\), respectively.

Following this reasoning, we find using \({{r}^{0}}\) the initial direction vector for the conjugate direction method from the orthogonality conditions

$${{V}^{{\text{T}}}}{{A}^{{\gamma + 1}}}{{p}^{0}} = 0,\quad {{p}^{0}} = B_{2}^{{ - 1}}{{r}^{0}},\quad B_{2}^{{ - 1}} = I - {{A}^{{\gamma + 1}}}V{{({{V}^{{\text{T}}}}{{A}^{{2\gamma + 2}}}V)}^{{ - 1}}}{{V}^{{\text{T}}}}{{A}^{{\gamma + 1}}}.$$
(40)

The matrix \({{B}_{2}}\) here plays the role of the second preconditioner in addition to \({{B}_{1}}\) in (37), and they both can be used in the subsequent iterations by the formulas of multipreconditioned conjugate directions (see [17]). Note that the exponents of the matrix \(A\) in (40) are chosen in such a way that they ensure the symmetry of \({{B}_{2}}\).

The choice of the basis vectors \({{{v}}_{1}},\; \ldots ,\;{{{v}}_{m}}\) that effectively extend the Krylov subspace is important in many accelerating iterative procedures. For example, the coarse-grid correction algorithm that is used in DDMs is based on the simple case of “geometric” decomposition of the grid domain \({{\Omega }_{q}}\), and its components are 1 or 0, depending on whether or not the corresponding node lies in the \(q\)th subdomain. The further elaboration of this approach is in the transition from piecewise constant basis functions \({{{v}}_{q}}\) to piecewise polynomial functions of higher orders (see [53]).

3.2 Multigrid Methods

In the pioneering works by Fedorenko and Bakhvalov, the multigrid approaches were based on spectral-approximation principles with a separate suppression of the error components corresponding to low and high frequencies. The further development of these approaches took the geometric, algebraic (the most popular name is AMG—Algebraic Multi-Grid) and combinatorial directions (see [5, 56, 6067]).

These approaches occupy an exceptional place in computational algebra, since they are asymptotically optimal in order, i.e. the required computer resources are proportional to the SLAE size. The most general algebraic methodology is based on the universal principles of matrix preconditioning. The implementation of technologies with various number of nested grids is interpreted as a recursive application of a two-grid algorithm. For simplicity, consider the sequence of nested grids \({{\Omega }_{m}} \subset \; \ldots \; \subset {{\Omega }_{2}} \subset {{\Omega }_{1}}\); i.e., the nodes of the coarser grids are nodes of the “nearest” finer grid \({{\Omega }_{k}}\). We assume that \({{\Omega }_{k}}\) is obtained from \({{\Omega }_{{k + 1}}}\) by “refining by a factor of two,” so that the numbers of grid nodes are related as \({{N}_{k}} \approx {{2}^{d}}{{N}_{{k + 1}}}\), where \(d\) is the domain dimension. To solve the SLAE \(Au = f\) on \(\Omega \), which may be renamed as \({{A}_{1}}{{u}_{1}} = {{f}_{1}}\), a preconditioner \({{B}_{1}}\) is implicitly constructed by approximately solving the system with a specially constructed matrix \({{A}_{2}}\) for the grid \({{\Omega }_{2}}\), and so on; i.e., the process goes in a “loop.”

In the general modern interpretation, the AMG method can be represented by the following stages of iterative solution of the algebraic system \({{A}_{k}}{{u}_{k}} = {{f}_{k}}\) on \({{\Omega }_{k}}\).

1. Given an approximation of the vector \(u_{k}^{n}\), the residual on the “fine” grid \({{\Omega }_{k}}\) is found:

$$r_{k}^{n} = {{f}_{k}} - {{A}_{k}}u_{k}^{n},\quad {{A}_{1}} = A.$$
(41)

2. For the vector \(r_{k}^{n}\), preprocessing (presmoothing) is performed usually by making several iterations using a simple algorithm. More precisely, this procedure is performed in two steps. At the first step, the direction vector

$${{\hat {S}}_{k}}p_{k}^{n} = r_{k}^{n}$$
(42)

is calculated, where \({{\hat {S}}_{k}}\) is the operator (matrix) of this presmoothing step. Actually, \({{\hat {S}}_{k}}\) is an approximation of the matrix \({{A}_{k}}\) (a formal representation of the results of several iteration steps of the smoothing method). At the second step, the next (smoothed) approximate solution and the corresponding residual are found:

$$\tilde {u}_{k}^{n} = u_{k}^{n} + \tilde {p}_{k}^{n},\quad \tilde {r}_{k}^{n} = f - {{A}_{k}}\tilde {u}_{k}^{n} = r_{k}^{n} - {{A}_{1}}\tilde {p}_{k}^{n}.$$
(43)

3. Using the vector \(\tilde {r}_{k}^{n}\) determined on the coarse grid \({{\Omega }_{k}}\), the residual vector \(\tilde {r}_{{k + 1}}^{n}\) on the fine grid \({{\Omega }_{{k + 1}}}\) is formed:

$$\tilde {r}_{{k + 1}}^{n} = {{R}_{k}}\tilde {r}_{k}^{n},\quad {{R}_{k}} \in {{\mathcal{R}}^{{{{N}_{k}},{{N}_{{k + 1}}}}}},\quad \tilde {r}_{k}^{n} \in {{\mathcal{R}}^{{{{N}_{k}}}}},$$
(44)

where \({{R}_{k}}\) is a restriction operator (the restriction stage). In the simplest case, this operation is the projection of the vector from the grid \({{\Omega }_{k}}\) onto \({{\Omega }_{{k + 1}}}\).

4. On the fine grid, the direction vector \(\hat {p}_{{k + 1}}^{n}\) is found from the solution of the SLAE

$${{A}_{{k + 1}}}\hat {p}_{{k + 1}}^{n} = \tilde {r}_{{k + 1}}^{n},\quad {{A}_{{k + 1}}} \in {{\mathcal{R}}^{{{{N}_{{k + 1}}},{{N}_{{k + 1}}}}}},\quad \hat {p}_{{k + 1}}^{n},\hat {r}_{{k + 1}}^{n} \in {{\mathcal{R}}^{{{{N}_{{k + 1}}}}}},$$
(45)

where \({{A}_{{k + 1}}}\) is matrix of the SLAE for the grid \({{\Omega }_{{k + 1}}}\), which in a certain sense inherits the approximation and algebraic properties of the operator \({{A}_{k}}\) on \({{\Omega }_{k}}\). There are various methods for constructing the matrix \({{A}_{k}}\), which is sometimes called the coarse grid correction matrix.

5. The vector \(\hat {p}_{{k + 1}}^{n}\) found from the solution of system (45) is continued from the coarse grid \(\Omega \) to the fine grid \({{\Omega }_{k}}\) (the prolongation stage):

$${{\check{p}_{k}^{n}}}\,={{P}_{k}}\hat{p}_{k+1}^{n},\quad {{P}_{k}}\in {{\mathcal{R}}^{{{N}_{k}},{{N}_{k+1}}}},\quad {{\check{p}_{k}^{n}}}\,\in {{\mathcal{R}}^{{{N}_{k}}}}.$$
(46)

In a certain sense, a consistent definition of the operators above is such that satisfies the relation \(A_{k}^{{ - 1}} \approx {{P}_{k}}A_{{k + 1}}^{{ - 1}}{{R}_{k}}\). If, for example, the matrix \({{A}_{k}}\) is symmetric, then in order to inherit this property on the fine grid \({{\Omega }_{{k + 1}}}\), it is reasonable to use \({{P}_{k}} = R_{k}^{{\text{T}}}\). In particular, this gives the so-called Galerkin approximation \({{A}_{{k + 1}}} = P_{k}^{{\text{T}}}{{A}_{k}}{{P}_{k}}\).

6. For the vector \({{\check{p}_{k}^{n}}}\,\), the corresponding vectors of the approximate solution and residual on the fine grid (residual update) are found by

$${{\check{u}_{k}^{n}}}\,=\tilde{u}_{k}^{n}+{{\check{p}_{k}^{n}}}\,,\quad {{\check{r}_{k}^{n}}}\,={{f}_{k}}-{{A}_{k}}{{\check{u}_{k}^{n}}}\,=r_{k}^{n}-{{A}_{k}}{{\check{p}_{k}^{n}}}\,,\quad {{\check{r}_{k}^{n}}}\,\in {{\mathcal{R}}^{{{N}_{k}}}}.$$
(47)

7. Postprocessing, i.e., repeated smoothing, for the new direction vector and the iterative approximation of the solution on the fine grid \({{\Omega }_{k}}\) is performed with the simultaneous calculation of the new direction vector \(\hat {p}_{k}^{n}\) on the basis of the solution of the auxiliary SLAE with the matrix \({{{\check{S}_{k}}}}\,\) (the repeated smoothing operator, the postsmoothing stage)

$${{{\check{S}_{k}}}}\,{\kern 1pt} \hat{p}_{k}^{n}={{\check{r}_{k}^{n}}}\,$$
(48)

similarly to (42).

8. The ultimate direction vector is determined as a result of the first AMG iteration:

$$p_{k}^{n}=\mathop{{\hat{p}}}_{k}^{n}+{{\check{p}_{k}^{n}}}\,\ +\tilde{p}_{k}^{n}=B_{k}^{-1}r_{k}^{n},$$

where \({{B}_{k}}\) is the preconditioning matrix of this two-grid stage of the method which can be explicitly represented in terms of the smoothing, restriction, coarse-grid correction, and prolongation operators; however, its specific representation is of no importance because the general computations and its implementation are described by formulas (41)(48). There are a lot of specific versions of the two-grid method, and all of them (in the framework of the stationary iterative process) have the form

$$u_{k}^{{n + 1}} = u_{k}^{n} + p_{k}^{n} = u_{k}^{n} + B_{k}^{{ - 1}}r_{k}^{n},$$
(49)

which is a two-level iterative process since the operation of multiplying by \(B_{k}^{{ - 1}}\) includes the solution of a SLAE with the matrix \({{A}_{{k + 1}}}\), which is unreasonable to do by the direct algorithm in the case of large \({{N}_{{k + 1}}}\). It is clear that a preconditioned method in Krylov subspaces can be constructed on the basis of algorithm (49).

However, the idea of the multigrid approach is to use the recursive principle: to approximately solve the equation on a coarse grid \({{\Omega }_{{k + 1}}}\), the methods described above are applied on the coarser grid \({{\Omega }_{{k + 2}}}\), and so on. The computations on each grid \({{\Omega }_{k}}\) by formulas (41)(48) may be repeated a prescribed number of times \(\gamma \), which can be written as a generalization of relation (49):

$$u_{k}^{{n + 1}} = u_{k}^{n} + B_{k}^{{ - \gamma }}r_{k}^{n}.$$
(50)

In practice, the number of such repetitions is \(\gamma = 1\) or \(\gamma = 2\). In the first case, the computation scheme is called \(V\)-loop, and in the second case it is W-loop. In [64, 65], an elaboration of this approach was proposed: K-loop, in which at \(\gamma = 2\) a Krylov-type (rather than stationary) iterative process is used (more specifically, the “flexible” conjugate gradient FCG method described in [40]). In the last formula, it should be taken into account that the preconditioner \({{B}_{k}}\) is expressed recursively for \(k = 1,\;2,\; \ldots ,\;m - 1\), and on the last \(m\)th grid SLAE (45) with the matrix \({{A}_{{m + 1}}}\) is solved directly (by a direct or iterative method).

Specific implementations of multigrid approaches, which have already became classical, differ in the way of choosing the matrix operators determining the successive stages of the computation scheme described above.

3.3 Incomplete Factorization Algorithms

The most classical methods of SLAE preconditioning are numerous explicit and implicit matrix factorization methods; a thorough analysis of such methods without parallelization aspects can be found in [10]. Modern approaches to these technologies, including promising approximations of matrices that are inverse of triangular factors in incomplete matrix decompositions are described in [6870] and references therein.

The typical form of easily invertible preconditioned matrix is obtained by the approximate factorization

$$B = (G + L){{G}^{{ - 1}}}(G + U) = G + L + U + L{{G}^{{ - 1}}}U.$$
(51)

Here \(L\) and \(U\) are the lower and upper triangular parts of the original matrix \(A = D + L + U,\) and \(D\) is a block diagonal or diagonal (in a special case) matrix. On the basis of the general requirement \(B \approx A\), the block diagonal matrix \(G\) is constructed using the relation

$$G = D - \overline {L{{G}^{{ - 1}}}U} ,$$
(52)

where the overscore above a matrix denotes its banded approximation.

Based on formulas (51) and (52), various successive symmetric upper relaxation methods are constructed (in this case, \(G = 2{{\omega }^{{ - 1}}}D)\), where \(\omega \) is an iteration parameter; another group of methods is explicit and implicit incomplete factorization, which are analyzed in books [2, 5, 10, 68] and in special reviews [51, 69, 70]. A large family of algorithms is based on the incomplete LU-factorization of the original matrix, which in the symmetric case is reduced to the approximate Cholesky decomposition. The quality of the matrix approximation and the convergence rate are improved due to “compressing” the matrix factors; however, each iteration step becomes costlier.

A similar approach, but focused on parallelization, is to construct a sparse approximation of the inverse matrix, thus eliminating the need to solve triangular systems at each iteration. These methods were studied in both pointwise and block versions. Novel results in this direction were obtained by Eremin, Kaporin, Kolotilina, Kon’shin, and other researchers in [31, 32, 71, 72]. Another promising direction of research is based on the factorization (including approximate one) of hierarchical matrices in HSS formats using low-rank block structures in the LU decomposition (see [73, 74]).

Finally, pay attention to one more promising family of algorithms for constructing preconditioners that are based on network programming or graph transformation methods (see [7577]). An important point is that as traditionally the incomplete triangular decomposition of matrices is realized for nonsingular matrices, now there are its modifications for positive semidefinite cases.

A general approach to speeding up the iterations is based on the compensation principle or matching row sums, which is to find a matrix \(G\) in (51), (52) such that

$$B{{y}^{{(l)}}} = A{{y}^{{(l)}}},\quad l = 1,\;2,\; \ldots ,\;m,$$
(53)

on a given set of \(m\) trial vectors (see [49, 78] and references therein). In some works, this method is also called filtering.

To satisfy condition (53), the matrix \(G\) is transformed to the form

$$G = D - \overline {L{{G}^{{ - 1}}}U} - \theta S,$$
(54)

where \(\theta \in \left[ {0,1} \right]\) is a compensating parameter and \(S\) is a block diagonal matrix formed on the basis of the condition

$$S{{y}^{{(l)}}} = (L{{G}^{{ - 1}}}U - \overline {L{{G}^{{ - 1}}}U} ){{y}^{{(l)}}},\quad l = 1,\;2,\; \ldots ,\;m.$$
(55)

Among the incomplete factorization methods for solving network SLAEs, implicit methods are the fastest; they are based on triadiagonal matrix algorithms for solving auxiliary tridiagonal systems. Such algorithms can be effectively parallelized on multiprocessors using a multithreading implementation of recursive triadiagonal matrix procedures based on the reduction of original SLAEs or on bordering methods. Note that in [57] a two-thread block version of alternate-triangular matrix factorization method was proposed, in which each factor is not a lower or upper triangular matrix but rather consists of block rows of different orientations—some of them are lower triangular and the others are upper triangular; this decomposition is called twisted decomposition (see review in [27]).

If \(A\) is a symmetric matrix, i.e., \(D = {{D}^{{\text{T}}}}\) and \(L = {{U}^{{\text{T}}}}\), then it is natural to choose symmetric matrices \(G\) and \(B\) and use two-sided preconditioning for the original SLAE with preserving the symmetry of the final system:

$$\begin{gathered} \bar {A}\bar {u} = \bar {f},\quad \bar {A} = L_{B}^{{ - 1}}AU_{B}^{{ - 1}},\quad \bar {u} = {{U}_{B}}\bar {u},\quad \bar {f} = L_{B}^{{ - 1}}f, \\ B = {{L}_{B}}{{U}_{B}},\quad {{L}_{B}} = (G + L)U_{G}^{{ - 1}},\quad G = {{L}_{G}}{{U}_{G}},\quad {{U}_{G}} = L_{G}^{{\text{T}}}. \\ \end{gathered} $$
(56)

The preconditioned matrix \(\bar {A}\) can be reduced by simple transformations to the form

$$\begin{gathered} \bar {A} = {{(I + \bar {L})}^{{ - 1}}} + {{(I + \bar {U})}^{{ - 1}}} + {{(I + \bar {L})}^{{ - 1}}}(\bar {D} - 2I){{(I + \bar {U})}^{{ - 1}}}, \\ \bar {D} = L_{G}^{{ - 1}}DU_{G}^{{ - 1}},\quad \bar {L} = L_{G}^{{ - 1}}LU_{G}^{{ - 1}},\quad \bar {U} = L_{G}^{{ - 1}}UU_{G}^{{ - 1}}, \\ \end{gathered} $$
(57)

in which the vector-matrix multiplication can be implemented in the form

$$\bar {A}{v} = {{(I + \bar {L})}^{{ - 1}}}[{v} + (\bar {D} - 2I)w] + w,\quad w = {{(I + \bar {U})}^{{ - 1}}}{v;}$$
(58)

in many cases, this saves a significant amount of computations. Note that the matrix representations (56)–(58) also hold without the requirements \({{L}_{B}} = U_{B}^{{\text{T}}}\) and \({{L}_{G}} = U_{G}^{{\text{T}}}\); i.e., these formulas also apply to asymmetric SLAEs.

The direct inversion of triangular matrices used in the algorithms described above is poorly parallelized. There are special tricks for mitigating this drawback (e.g., see [79] and the review in [80]). We consider another approach that uses alternate-triangular matrices \(\hat {L}\) and \(\hat {U}\) instead of \(L\) and U, which have nonzero elements only on the left of the principal diagonal in some rows and nonzero elements only on the right in the other rows. We illustrate the construction of such preconditioners by way of a block tridiagonal matrix

$$Au = \left[ {\begin{array}{*{20}{c}} {{{D}_{1}}}&{{{U}_{1}}}&{}&{}&0 \\ {{{L}_{2}}}&{{{D}_{2}}}&{{{U}_{2}}}&{}&{} \\ {}& \ddots & \ddots & \ddots &{} \\ {}&{}&{}&{}&{{{U}_{{{{N}_{x}} - 1}}}} \\ 0&{}&{}&{{{L}_{{{{N}_{x}}}}}}&{{{D}_{{{{N}_{x}}}}}} \end{array}} \right].$$
(59)

In this case, we define the alternate-triangular matrices by

$$\hat {L} = \left[ {\begin{array}{*{20}{c}} 0&0&{}&{}&{}&{}&{} \\ {{{L}_{2}}}&0&0&{}&{}&0&{} \\ {}&{{{L}_{3}}}&0&0&{}&{}&{} \\ {}&{}&{{{L}_{4}}}&0&{{{U}_{4}}}&{}&{} \\ {}&{}&{}&0&0&{{{U}_{5}}}&{} \\ {}&0&{}&{}&0&0&{{{U}_{6}}} \\ {}&{}&{}&{}&{}&0&0 \end{array}} \right],\quad \hat {U} = \left[ {\begin{array}{*{20}{c}} 0&{{{U}_{1}}}&{}&{}&{}&{}&{} \\ 0&0&{{{U}_{2}}}&{}&{}&0&{} \\ {}&0&0&{{{U}_{3}}}&{}&{}&{} \\ {}&{}&0&0&0&{}&{} \\ {}&{}&{}&{{{L}_{5}}}&0&0&{} \\ {}&0&{}&{}&{{{L}_{6}}}&0&0 \\ {}&{}&{}&{}&{}&{{{L}_{7}}}&0 \end{array}} \right].$$
(60)

In this case, \(A = D + \hat {L} + \hat {U};\) therefore, the incomplete factorization formulas (51)(58) for determining the preconditioner \(B\) remain valid. Only \(L\) and \(U\) should be replaced by \(\hat {L}\) and \(\hat {U}\), respectively. In this case, the tridiagonal blocks of the matrix \(\bar {G} = \{ {{\bar {G}}_{i}}\} \) are found by formulas similar to (28) and (29) but using opposite recursions (block tridiagonal method [25]), which we write for an arbitrary odd \({{N}_{x}} = 2m + 1\):

$$\begin{array}{*{20}{c}} {{{G}_{1}} = D,\quad {{G}_{i}} = {{D}_{i}} - {{L}_{i}}G_{{i - 1}}^{{ - 1}}{{U}_{{i - 1}}} - \theta {{S}_{i}},\quad i = 2,\; \ldots ,\;m,} \\ {{{G}_{{{{N}_{x}}}}} = {{D}_{{{{N}_{x}}}}},\quad {{G}_{i}} = {{D}_{i}} - {{U}_{i}}G_{{i + 1}}^{{ - 1}}r{{U}_{{i + 1}}} - \theta {{S}_{i}},\quad i = {{N}_{x}} - 1,\; \ldots ,\;m + 2,} \\ {{{S}_{i}}{{y}^{{(l)}}} = {{L}_{i}}(G_{{i - 1}}^{{ - 1}} - G_{{i - 1}}^{{ - 1}}){{U}_{{i - 1}}}{{y}^{{(l)}}},\quad l = 1,\; \ldots ,\;{{l}_{i}},} \\ {{{G}_{{m + 1}}} = {{D}_{{m + 1}}} - {{L}_{{m + 1}}}G_{m}^{{ - 1}}{{U}_{m}} - {{U}_{{m + 1}}}G_{{m + 2}}^{{ - 1}}{{L}_{{m + 2}}} - \theta {{S}_{{m + 1}}},} \\ {{{S}_{{m + 1}}}{{y}^{{(l)}}} = [{{L}_{{m + 1}}}(G_{m}^{{ - 1}} - G_{m}^{{ - 1}}){{U}_{m}} - {{U}_{{m + 1}}}(G_{{m + 2}}^{{ - 1}} - G_{{m + 2}}^{{ - 1}})]{{y}^{{(l)}}},} \end{array}$$
(61)

where \({{S}_{i}}\) are diagonal matrices for \(l = 1\) and tridiagonal for \(l = 2\).

When the matrix \(\bar {B}\) is used as a preconditioner for an iterative process, the following auxiliary system must be solved at each step:

$$Bp \equiv (G + \hat {L}){{G}^{{ - 1}}}(G + \hat {U})p = r.$$

Its solution can be found from the successive relations

$$(G + \hat {L}){v} = r,\quad Gw = \hat {U}p,\quad p = {v} - w,$$

which are implemented by twisted decomposition by the formulas

$${{{v}}_{1}} = G_{1}^{{ - 1}}{{r}_{1}},\quad i = 2,\; \ldots ,\;m:{{{v}}_{i}} = G_{1}^{{ - 1}}({{r}_{i}} - {{\hat {L}}_{i}}{{{v}}_{{i - 1}}}),$$
$${{{v}}_{{{{N}_{x}}}}} = G_{{{{N}_{x}}}}^{{ - 1}}{{r}_{{{{N}_{x}}}}},\quad i = {{N}_{x}} - 1,\; \ldots ,\;m + 2:{{{v}}_{i}} = G_{i}^{{ - 1}}({{r}_{i}} - \hat {U}_{i}^{{}}{{{v}}_{{i + 1}}}),$$
$${{{v}}_{{m + 1}}} = G_{{m + 1}}^{{ - 1}}({{r}_{{m + 1}}} - {{\hat {L}}_{{m + 1}}}{{{v}}_{m}} - {{\hat {U}}_{{m + 1}}}{{{v}}_{{m + 2}}}) = {{p}_{{m + 1}}},$$
(62)
$$i = m,\; \ldots ,\;1:{{w}_{i}} = G_{i}^{{ - 1}}{{\hat {U}}_{i}}{{p}_{{i + 1}}},\quad {{p}_{i}} = {{{v}}_{i}} - {{w}_{i}},$$
$$i = m + 2,\; \ldots ,\;{{N}_{x}}:{{w}_{i}} = G_{i}^{{ - 1}}{{\hat {L}}_{i}}{{p}_{{i - 1}}},\quad {{p}_{i}} = {{{v}}_{i}} - {{w}_{i}}.$$

In recurrences (61) and (62), the computations with increasing index \(i\) and with its decreasing may be called forward and backward recursions, respectively. It is clear that they can be executed simultaneously in two threads. This limited parallelism is based on the alternate-triangulation (60) for the matrices \(L\) and \(U\). This approach can be generalized for the case of \(P\)-alternate triangulation; i.e., we can define matrices \(\hat {L}\) and \(\hat {U}\) satisfying the condition \(\hat {L} + \hat {U} = L + U\) as consisting of \(P\) block rows each of which successively changes the triangulation type. Formulas for the block tridiagonal matrix algorithm (61), (62) are more complicated, but they can be parallelized to \(P\) processors.

For example, matrix (59) corresponds to the five-point Laplace equations on a rectangular grid with \(N = {{N}_{x}} \cdot {{N}_{y}}\) nodes. In this case, \({{D}_{i}}\) are tridiagonal blocks and \({{L}_{i}}\) and \({{U}_{i}}\) are diagonal blocks. In the case of three-dimensional problems on a box grid with \(N = {{N}_{x}} \cdot {{N}_{y}} \cdot {{N}_{z}}\) nodes, the matrix \(A\) can be represented in block tridiagonal form (59) in which each diagonal block \({{D}_{k}}\) has the size \(N = {{N}_{x}} \cdot {{N}_{y}}\) and is a matrix of the same structure as (59).

The elaboration of the above approaches for solving SLAEs on three-dimensional grids received the name of nested factorization methods; they were proposed in 1983 by Appleyard and Cheshire and later studied by many researchers (see [57, 82, 83] and references therein). Let the matrix \(A\) have the form

$$A = D + {{L}_{1}} + {{U}_{1}} + {{L}_{2}} + {{U}_{2}} + {{L}_{3}} + {{U}_{3}},$$
(63)

where \(D\) is a diagonal and \({{L}_{k}}\) and \({{U}_{k}},k = 1,2,3\), are lower and upper triangular matrices. Let us find the preconditioning matrix \(B\) using the recursive factorization (51):

$$B = (P + {{L}_{3}}){{P}^{{ - 1}}}(P + {{U}_{3}}) = P + {{L}_{3}} + {{U}_{3}} + {{L}_{3}}{{P}^{{ - 1}}}{{U}_{3}},$$
$$P = (T + {{L}_{2}}){{T}^{{ - 1}}}(T + {{U}_{2}}) = T + {{L}_{2}} + {{U}_{2}} + {{L}_{2}}{{T}^{{ - 1}}}{{U}_{2}},$$
(64)
$$T = (M + {{L}_{1}}){{M}^{{ - 1}}}(M + {{U}_{1}}) = M + {{L}_{1}} + {{U}_{1}} + {{L}_{1}}{{M}^{{ - 1}}}{{U}_{1}};$$

as a result, we obtain

$$B = M + A - D + {{L}_{1}}{{M}^{{ - 1}}}{{U}_{1}} + {{L}_{2}}{{T}^{{ - 1}}}{{U}_{2}} + {{L}_{3}}{{P}^{{ - 1}}}{{U}_{3}}.$$
(65)

Depending on the definition of M, different preconditioners \(B\) can be formed. We consider a simple case when matrix (63) corresponds to the seven-point approximation of the Dirichlet problem for the Poisson equation in a box on a box grid. Then, in the case of natural ordering of the nodes, the matrices M, \(T\) and \(P\) can be made diagonal, tridiagonal, and five-diagonal, respectively, and the preconditioner can be found by the formulas

$$\begin{gathered} M = D - {{L}_{1}}{{M}^{{ - 1}}}{{U}_{1}} - {{\theta }_{2}}{{S}_{2}} - {{\theta }_{3}}{{S}_{3}}, \\ B = A + {{L}_{2}}{{T}^{{ - 1}}}{{U}_{2}} + {{L}_{3}}{{P}^{{ - 1}}}{{U}_{3}} - {{\theta }_{2}}{{S}_{2}} - {{\theta }_{3}}{{S}_{3}}. \\ \end{gathered} $$
(66)

Here \({{\theta }_{2}}\) and \({{\theta }_{3}}\) are iteration (relaxing) parameters, and \({{S}_{2}}\) and \({{S}_{3}}\) are diagonal matrices determined from the conditions of matching row sums similarly to how this was done above:

$${{S}_{2}}e = {{L}_{2}}{{T}^{{ - 1}}}{{U}_{2}}e,\quad {{S}_{3}}e = {{L}_{3}}{{P}^{{ - 1}}}{{U}_{3}}e.$$
(67)

Note that instead of (67) we may use matching column sums rather than row sums (see [83] and references therein):

$${{e}^{{\text{T}}}}{{S}_{2}} = {{e}^{{\text{T}}}}{{L}_{2}}{{T}^{{ - 1}}}{{U}_{2}},\quad {{e}^{{\text{T}}}}{{S}_{3}} = {{e}^{{\text{T}}}}{{L}_{3}}{{P}^{{ - 1}}}{{U}_{3}}.$$
(68)

The three-level nested factorization considered above can be structurally simplified by reducing it to the two-level one. To this end, we rewrite the original matrix \(A\) (63) in the form

$$A = {{D}_{3}} + {{L}_{3}} + {{U}_{3}},\quad {{D}_{3}} = {{D}_{2}} + {{L}_{2}} + {{U}_{2}},\quad {{D}_{2}} = D + {{L}_{1}} + {{U}_{1}}.$$
(69)

In this case, \({{D}_{3}}\) is a block diagonal matrix with five-diagonal blocks \({{D}_{{3,i}}}\) of size \({{N}_{y}}{{N}_{z}}\) each of which corresponds to a plane problem in the section \(x = {\text{const}}\) and has the same structure as the matrix \(A\) in (59). Then, the matrix \(P\) in (64) is defined by the formula

$$P = {\text{\{}}{{G}_{i}} = {{D}_{{3,i}}} - \theta {{S}_{i}}{\text{\}}},\quad {{S}_{i}}e = {{L}_{{3,i}}}G_{{i - 1}}^{{ - 1}}{{U}_{{i - 1}}}e,$$

which corresponds to the definition \({{L}_{3}}{{P}^{{ - 1}}}{{U}_{3}} = 0\) in (64). Note that \({{L}_{3}}\) and \({{U}_{3}}\) can be defined as alternate-triangular matrices, and then the algorithm implementation can be parallelized using twisted decomposition.

3.4 Solving Saddle SLAEs

Consider a SLAE with a saddle matrix

$$A\left[ {\begin{array}{*{20}{c}} u \\ p \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} f \\ g \end{array}} \right],\quad A = \left[ {\begin{array}{*{20}{c}} D&{{{C}^{{\text{T}}}}} \\ C&0 \end{array}} \right],$$
(70)

where \(A \in {{\mathcal{R}}^{{N,N}}}\), \(D \in {{\mathcal{R}}^{{{{N}_{1}},{{N}_{1}}}}}\), \(C \in {{\mathcal{R}}^{{{{N}_{2}},{{N}_{1}}}}}\), \(N = {{N}_{1}} + {{N}_{2}}\), \(f \in {{\mathcal{R}}^{{{{N}_{1}}}}}\), and \(p,g \in {{\mathcal{R}}^{{{{N}_{2}}}}}\). The matrix \(D\) is assumed to be symmetric and positive semidefinite. A necessary and sufficient condition for the nonsingularity of \(A\) is \(\ker (D) \cap \ker (C) = {\text{\{}}0{\text{\}}}\), and a sufficient condition is the positive definiteness of \(D\) on \(\ker (C)\) and \(\ker ({{C}^{{\text{T}}}}) = {\text{\{}}0{\text{\}}}\), which we assume to be fulfilled.

Problems with a saddle point often arise in mathematical formulations in modeling, including initial boundary mixed formulations for differential classical or generalized equations, optimization problems, and computational algebra (see [8487] and references therein). We focus on methods for solving saddle SLAEs with large sparse matrices that arise in grid approximations of multidimensional boundary value problems in many applications—electromagnetism, fluid dynamics, elasticity, plasticity, multiphase filtering in porous media, in optimization problems, etc.

Due to special features of the block structure of saddle algebraic systems, methods for their solution are studied in numerous works, e.g., see the reviews by Golub and his colleagues [84], Brezzi [85], Vassilevski [72], and Notay [86], book by Bychenkov and Chizhonkov [87], series of papers by C. Greif and his colleagues (including the case of asymmetric saddle SLAEs [68, 85]), and Arioli with coauthors [25, 75] (also see the reviews in [9092]).

Note that without loss of generality the saddle SLAE can be considered in the form

$$\left[ {\begin{array}{*{20}{c}} D&{{{C}^{{\text{T}}}}} \\ C&0 \end{array}} \right]\left[ {\begin{array}{*{20}{c}} u \\ p \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} f \\ 0 \end{array}} \right].$$
(71)

Indeed, if we take a particular solution of system \(C\hat {u} = g\), then the vector \(u = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\smile}$}}{u} + \hat {u}\), which is a solution of SLAE (70), satisfies the system

$$\left[ {\begin{array}{*{20}{c}} D&{{{C}^{{\text{T}}}}} \\ C&0 \end{array}} \right]\left[ {\begin{array}{*{20}{c}} \check{u} \\ p \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} f-D\hat{u} \\ 0 \end{array}} \right].$$

Note that any solution of SLAE (71) simultaneously satisfies the system

$$\tilde {A}{v} = \tilde {A}\left[ {\begin{array}{*{20}{c}} u \\ p \end{array}} \right] \equiv \left[ {\begin{array}{*{20}{c}} {\tilde {D}}&{{{C}^{{\text{T}}}}} \\ C&0 \end{array}} \right]\left[ {\begin{array}{*{20}{c}} u \\ p \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} f \\ 0 \end{array}} \right] \equiv \tilde {f},\quad \tilde {D} = D + R,\quad R = {{C}^{{\text{T}}}}K_{0}^{{ - 1}}C,$$
(72)

where \({v},\tilde {f} \in {{\mathcal{R}}^{N}}\), and \({{K}_{0}} \in {{\mathcal{R}}^{{{{N}_{1}},{{N}_{1}}}}}\) is an arbitrary nonsingular matrix. Since the last system is formally a regularization (or generalization) of SLAE (71), we below discuss algorithms for solving Eq. (72).

Using the Schur complement

$$S = - C{{\tilde {D}}^{{ - 1}}}{{C}^{{\text{T}}}},$$

the matrix of system (72) is factorized as

$$\tilde {A} = \left[ {\begin{array}{*{20}{c}} {\tilde {D}}&0 \\ C&S \end{array}} \right]\left[ {\begin{array}{*{20}{c}} I&{{{{\tilde {D}}}^{{ - 1}}}{{C}^{{\text{T}}}}} \\ 0&I \end{array}} \right].$$

If the matrices \(\bar {D}\) and \(S\) are replaced by their approximations (preconditioners) \({{B}_{d}}\) and \({{B}_{s}}\), then we obtain a preconditioner for the matrix \(B\) in the form of incomplete block factorization

$${{B}_{1}} = \left[ {\begin{array}{*{20}{c}} {{{B}_{d}}}&0 \\ C&{ - {{B}_{s}}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} I&{B_{d}^{{ - 1}}{{C}^{{\text{T}}}}} \\ 0&I \end{array}} \right].$$
(73)

A somewhat coarser approximation when only one (left or right) factor is used in (73) gives an incomplete block triangular preconditioning with the matrix

$${{B}_{2}} = \left[ {\begin{array}{*{20}{c}} {{{B}_{d}}}&{{{C}^{{\text{T}}}}} \\ 0&{ - {{B}_{s}}} \end{array}} \right]$$
(74)

or the incomplete Uzawa preconditioner

$${{B}_{3}} = \left[ {\begin{array}{*{20}{c}} {{{B}_{d}}}&0 \\ C&{ - {{B}_{s}}} \end{array}} \right].$$
(75)

The implementation of each step of the corresponding iterative processes can be divided into several stages at which only one block component of the desired solution is recalculated (for this reason, these methods are sometimes called segregative). In a slightly generalized form, we represent such a stationary algorithm (without the Krylov acceleration) by the following three stages (see [83, 85, 86]):

$$\hat {u}_{d}^{{n + 1}} = u_{d}^{n} + Q_{d}^{{(1)}}({{f}_{d}} - \bar {D}u_{d}^{n} - {{C}^{{\text{T}}}}u_{c}^{n}),$$
$$u_{s}^{{n + 1}} = u_{s}^{n} - B_{s}^{{ - 1}}({{f}_{s}} - C\hat {u}_{d}^{{ - 1}}),$$
(76)
$$u_{d}^{{n + 1}} = \hat {u}_{d}^{{n + 1}} + Q_{d}^{{(2)}}({{f}_{d}} - \bar {D}A\hat {u}_{d}^{{n + 1}} - {{C}^{{\text{T}}}}u_{s}^{{n + 1}}).$$

Here \(Q_{d}^{{(1)}}\) and \(Q_{d}^{{(2)}}\) are approximations of inverse or generalized inverse matrix of the preconditioner \({{B}_{d}}\). In particular, if \(Q_{d}^{{(1)}} = B_{d}^{{ - 1}}\), \(Q_{d}^{{(2)}} = 0\) or \(Q_{d}^{{(1)}} = 0\), \(Q_{d}^{{(2)}} = B_{d}^{{ - 1}}\), then we obtain from (76) the Uzawa algorithm with the preconditioner \({{B}_{3}}\) defined in (75) (in this case, the third stage is omitted, i.e., \(u_{d}^{{n + 1}} = \hat {u}_{d}^{{n + 1}}\)) or the incomplete block triangular preconditioning with the matrix \({{B}_{2}}\) defined in (74) (in this case, the first stage in (76) is omitted and \(u_{d}^{{n + 1}} = u_{d}^{n}\)).

If the matrix \({{Q}_{d}} = Q_{d}^{{(1)}} + Q_{d}^{{(2)}} - Q_{d}^{{(2)}}\bar {D}Q_{d}^{{(1)}}\) is singular, then the iterative process (76) is associated with the preconditioner

$${{B}_{4}} = \left( {\begin{array}{*{20}{c}} I&0 \\ {CQ_{d}^{{(1)}}}&I \end{array}} \right)\left( {\begin{array}{*{20}{c}} {Q_{d}^{{ - 1}}}&0 \\ 0&{ - {{B}_{s}}} \end{array}} \right)\left( {\begin{array}{*{20}{c}} I&{Q_{d}^{{(2)}}{{C}^{{\text{T}}}}} \\ 0&I \end{array}} \right).$$
(77)

In this case, for \(Q_{d}^{{(1)}} = Q_{d}^{{(2)}} = B_{d}^{{ - 1}}\) (77) implies the so-called symmetrized Uzawa method with the preconditioned matrix

$${{B}_{5}} = \left( {\begin{array}{*{20}{c}} I&0 \\ {CB_{d}^{{ - 1}}}&I \end{array}} \right)\left( {\begin{array}{*{20}{c}} {{{B}_{d}}{{{(2{{B}_{d}} - \bar {D})}}^{{ - 1}}}{{B}_{d}}}&0 \\ 0&{{{B}_{s}}} \end{array}} \right)\left( {\begin{array}{*{20}{c}} I&{B_{d}^{{ - 1}}{{C}^{{\text{T}}}}} \\ 0&I \end{array}} \right).$$

A promising block diagonal preconditioner for solving SLAE (72) is

$${{B}_{6}} = \left[ {\begin{array}{*{20}{c}} {\tilde {D} + {{C}^{{\text{T}}}}K_{1}^{{ - 1}}C}&0 \\ 0&{{{K}_{2}}} \end{array}} \right],$$
(78)

where \({{K}_{1}}\) and \({{K}_{2}}\) are symmetric nonsingular matrices.

The eigenvalues and eigenvectors of the preconditioned matrix \(\bar {A} = B_{6}^{{ - 1}}\tilde {A}\) from the “disturbed” SLAE (72) are determined by the eigenvalue problem

$$\lambda {{B}_{6}}z = \tilde {A}z,\quad z = {{(z_{1}^{{\text{T}}},z_{2}^{{\text{T}}})}^{{\text{T}}}},\quad {{z}_{k}} \in {{\mathcal{R}}^{{{{N}_{k}}}}},\quad k = 1,\;2,$$

which in componentwise form is written as

$$\begin{gathered} (D + {{C}^{{\text{T}}}}{{K}_{0}}C){{z}_{1}} + {{C}^{{\text{T}}}}{{z}_{2}} = \lambda (\tilde {D} + {{C}^{{\text{T}}}}K_{1}^{{ - 1}}C){{z}_{1}}, \\ C{{z}_{1}} = \lambda {{K}_{2}}{{z}_{2}}. \\ \end{gathered} $$
(79)

The analysis of this eigenvalue problem yields interesting results for various special cases. In particular, it was shown in [81, 82] that unique spectral properties of the matrix \(\bar {A}\) are obtained if the block \(\tilde {D}\) is strongly singular, which is important for algorithms designed for solving systems of Maxwell equations. For example, we have the following results.

Proposition 1. Let \({{B}_{6}}\) be an spd matrix and \({\text{\{}}{{z}_{i}},\;i = 1,\;2,\; \ldots ,\;{{N}_{1}} - {{N}_{2}}{\text{\}}}\) be a basis of the kernel of the matrix \(C\). Then, the preconditioned matrix \(B_{6}^{{ - 1}}\tilde {A}\) has \({{N}_{1}} - {{N}_{2}}\) linearly independent vectors \((z_{i}^{{\text{T}}}0) \in {{\mathcal{R}}^{N}}\) associated with \({{N}_{1}} - {{N}_{2}}\) multiple eigenvalue \(\lambda = 1\).

Proposition 2. Let \({{K}_{1}} = {{K}_{2}} = K\), and let \(\tilde {D}\) be a symmetric positive semidefinite matrix with the kernel dimension equal to \(r\). Then, \(\lambda = 1\) is an eigenvalue of the matrix \(B_{6}^{{ - 1}}\tilde {A}\) with the multiplicity \({{N}_{1}}\), \(\lambda = - 1\) is an eigenvalue of multiplicity \(r\), and the other \({{N}_{2}} - r\) eigenvalues lie in the interval \(\lambda \in ( - 1,\;0)\) and satisfy the relation

$$\lambda = - \mu {\text{/}}(\mu + 1),$$

where \(\mu \) are \({{N}_{2}} - r\) positive generalized eigenvalues

$$\mu \tilde {D}z = {{C}^{{\text{T}}}}KCz.$$

Let \({\text{\{}}{{z}_{i}},\;i = 1,\;2,\; \ldots ,\;{{N}_{1}} - {{N}_{2}}{\text{\}}}\) be a basis of the kernel of \(C\), \({\text{\{}}{{x}_{i}},\;i = 1,\;2,\; \ldots ,\;{{N}_{2}} - 2{\text{\}}}\) be a basis of the kernel of \(\tilde {D}\), and \({\text{\{}}{{y}_{i}},\;i = 1,\;2,\; \ldots ,\;{{N}_{2}} - r{\text{\}}}\) be the set of linearly independent vectors complementing \(\ker (\tilde {D}) \cup \ker (C)\) in the basis \({{\mathcal{R}}^{{{{N}_{1}}}}}\). Then, \({{N}_{1}} - {{N}_{2}}\) vectors \(({{z}_{i}},0)\), \(r\) vectors (\({{x}_{i}}\), \({{K}^{{ - 1}}}C{{x}_{i}}\)), and \({{N}_{2}}\)\(r\) vectors (\({{y}_{i}}\), \({{K}^{{ - 1}}}{{y}_{i}}\)) are linearly independent eigenvectors corresponding to the eigenvalues \(\lambda = 1\), and \(r\) vectors (\({{x}_{i}}\), \({{K}^{{ - 1}}}{{x}_{i}}\)) are the eigenvectors corresponding to the eigenvalues \(\lambda = - 1\).

On the whole, the presence of different matrices \({{K}_{0}}\), \({{K}_{1}}\), and \({{K}_{2}}\) in the preconditioner \({{B}_{6}}\) provides wide opportunities for constructing efficient algorithms in particular cases.

We consider a family of iterative methods for solving saddle symmetric SLAEs with the matrix \(\tilde {A}\) defined by (72); this family is based on the efficient Golub–Kahan \(G - K\)-bidiagonalization, which was originally proposed for the singular decomposition of rectangular matrices but later was successfully used for solving algebraic systems, including the case of block saddle structure, in the works by Saunders, Arioli, Greif, and other researchers.

Without loss of generality, we write the SLAE under examination in the form

$$\tilde {A}\left[ {\begin{array}{*{20}{c}} u \\ p \end{array}} \right] \equiv \left[ {\begin{array}{*{20}{c}} {\tilde {D}}&{{{C}^{{\text{T}}}}} \\ C&0 \end{array}} \right]\left[ {\begin{array}{*{20}{c}} u \\ p \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} 0 \\ g \end{array}} \right],\quad \tilde {D} = D + {{C}^{{\text{T}}}}{{K}^{{ - 1}}}C.$$
(80)

It is easy to verify that, if the function \(u\) in (72) is replaced by \(u + {{\tilde {D}}^{{ - 1}}}f\), then this system takes form (80) with the right-hand side \(g = - C{{\tilde {D}}^{{ - 1}}}f\). It is assumed that \(\tilde {D}\) and \(K\) in (80) are spd matrices and \({{N}_{1}} \geqslant {{N}_{2}}\).

The \(G - K\)-bidiagonalization method is based on the construction of \(\tilde {D}\)-orthogonal vectors \({{{v}}_{k}}\) and \(K\)-orthogonal vectors \({{q}_{k}}\) that satisfy the conditions

$$\begin{gathered} {{C}^{{\text{T}}}}Q = \tilde {D}V{{[{{B}^{{\text{T}}}}0]}^{{\text{T}}}},\quad {{V}^{{\text{T}}}}\tilde {D}V = {{I}_{{{{N}_{1}}}}}, \\ CV = KQ[{{B}^{{\text{T}}}}0],\quad {{Q}^{{\text{T}}}}KQ = {{I}_{{{{N}_{2}}}}}, \\ \end{gathered} $$
(81)

where \(V = [{{{v}}_{1}},\; \ldots ,\;{{{v}}_{{{{N}_{1}}}}}] \in {{\mathcal{R}}^{{{{N}_{1}},{{N}_{2}}}}}\), \(Q = [{{q}_{1}},\; \ldots ,\;{{q}_{{{{N}_{1}}}}}]\), and \(B \in {{\mathcal{R}}^{{{{N}_{2}},{{N}_{2}}}}}\) is a two-diagonal matrix

$$B = \left[ {\begin{array}{*{20}{c}} \alpha &{{{\beta }_{1}}}&0& \ldots &0 \\ 0&{{{\alpha }_{2}}}&{{{\beta }_{3}}}& \ddots &0 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0& \ldots &0&{{{\alpha }_{{{{N}_{2}} - 1}}}}&{{{\beta }_{{{{N}_{2}} - 1}}}} \\ 0& \ddots &0&0&{{{\alpha }_{{{{N}_{2}}}}}} \end{array}} \right].$$

By introducing new unknown functions

$$u = Vz,\quad p = Qg$$
(82)

and multiplying system (80) by the block diagonal matrix block-diag(\({{V}^{{\text{T}}}},{{Q}^{{\text{T}}}}\)), we obtain

$$\left[ {\begin{array}{*{20}{c}} {{{I}_{{{{N}_{2}}}}}}&0&B \\ 0&{{{I}_{{{{N}_{1}} - {{N}_{2}}}}}}&0 \\ {{{B}^{ \top }}}&0&0 \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{{z}_{1}}} \\ {{{z}_{2}}} \\ y \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} 0 \\ 0 \\ {{{Q}^{ \top }}g} \end{array}} \right],\quad {{z}_{1}} \in {{\mathcal{R}}^{{{{N}_{2}}}}},\quad {{z}_{2}} \in {{\mathcal{R}}^{{{{N}_{1}} - {{N}_{2}}}}}.$$
(83)

It follows from (83) that the vector \(u\) depends on \({{N}_{2}}\) columns of the matrix \(V\) because \({{z}_{2}} = 0\). Thus, SLAE (83) is reduced to the form

$$\left[ {\begin{array}{*{20}{c}} {{{I}_{{{{N}_{2}}}}}}&B \\ {{{B}^{ \top }}}&0 \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{{z}_{1}}} \\ y \end{array}} \right] = \left[ {\begin{array}{*{20}{c}} 0 \\ {{{Q}^{{\text{T}}}}g} \end{array}} \right].$$
(84)

Hence, we put

$${{q}_{1}} = {{K}^{{ - 1}}}g{\text{/}}{{\left\| g \right\|}_{{{{K}^{{ - 1}}}}}},\quad {{\alpha }_{1}}{{\tilde {D}}^{{ - 1}}}{{{v}}_{1}} = {{C}^{{\text{T}}}}{{q}_{1}},\quad {{\left\| g \right\|}_{{{{K}^{{ - 1}}}}}} = {{(g,{{K}^{{ - 1}}}g)}^{{1/2}}}$$

to obtain the initial vector \({{{v}}_{1}}\):

$${{{v}}_{1}} = w{\text{/}}{{\alpha }_{1}},\quad {{\alpha }_{1}} = \sqrt {{{w}^{{\text{T}}}}{{C}^{{\text{T}}}}{{q}_{1}}} ,\quad w = \tilde {D}{{C}^{{\text{T}}}}{{q}_{1}}.$$
(85)

The further vectors \({{{v}}_{n}}\), \({{q}_{i}}\), and the elements \({{\alpha }_{i}}\), \({{\beta }_{i}}\) of the matrix \(B\) are calculated from the recurrences (\(n = 1,\;2,\; \ldots \))

$$s = {{K}^{{ - 1}}}(C{{{v}}_{n}} - {{\alpha }_{i}}K{{q}_{n}}),\quad {{\beta }_{{n - 1}}} = \sqrt {{{s}^{{\text{T}}}}Ks} ,$$
$$\begin{gathered} {{q}_{{n + 1}}} = s{\text{/}}{{\beta }_{{n + 1}}},\quad w = {{{\tilde {D}}}^{{ - 1}}}({{C}^{{\text{T}}}}{{q}_{{n + 1}}} - {{\beta }_{{n + 1}}}\tilde {D}{{{v}}_{n}}), \\ {{\alpha }_{{n + 1}}} = {{({{w}^{{\text{T}}}}\tilde {D}w)}^{{1/2}}},\quad {{{v}}_{{n + 1}}} = w{\text{/}}{{\alpha }_{{n + 1}}}. \\ \end{gathered} $$
(86)

The successive approximations \({{u}^{n}}\), according to (82), are found from the first \(n\) columns of the matrix V, i.e.,

$${{u}^{n}} = {{V}_{n}}{{z}^{n}} = \sum\limits_{j = 1}^n {z_{1}^{j}} {{{v}}_{j}},\quad {{z}^{n}} = (z_{1}^{1},\; \ldots ,\;z_{1}^{n}),$$
(87)

where \(z_{1}^{j}\) are the components of the vector \({{z}_{1}}\) in (84) calculated by the formulas

$$z_{1}^{n} = {{\left\| g \right\|}_{{{{N}^{{ - 1}}}}}}{\text{/}}{{\alpha }_{1}},\quad z_{{j + 1}}^{n} = - {{\beta }_{{j + 1}}}z_{j}^{n}{\text{/}}{{\alpha }_{{j + 1}}},\quad j = 1,\;2,\; \ldots ,\;{{N}_{2}}.$$
(88)

We omit the details of derivation of these formulas (see [87]) and only present the final recurrences for the iterative solution

$$\begin{gathered} {{u}^{{n + 1}}} = {{u}^{n}} + z_{1}^{{n + 1}}{{{v}}_{{n + 1}}},\quad n = 1,\;2,\; \ldots ,\;{{N}_{2}}, \\ {{p}^{{n + 1}}} = {{p}^{n}} - z_{1}^{{n + 1}}{{d}_{{n + 1}}},\quad {{d}_{{n + 1}}} = ({{q}_{{n + 1}}} - {{\beta }_{{n + 1}}}{{\alpha }_{n}}){\text{/}}{{\alpha }_{{n + 1}}}, \\ \end{gathered} $$
(89)

where \({{d}_{n}}\) is the \(n\)th column of \(D = Q{{B}^{{ - 1}}}\). Note that the algorithm just described has the following optimization properties: at each \(n\)th step, the iterative approximation error has the minimum

$$\begin{gathered} \mathop {\min}\limits_{\begin{array}{*{20}{c}} {{{u}_{n}} \in {{\mathcal{U}}_{n}}} \\ {C{{u}^{n}} - g \bot {{\mathcal{Q}}_{n}}} \end{array}} {{\left\| {u - {{u}^{n}}} \right\|}_{{\tilde {D}}}}, \\ {{\mathcal{U}}_{n}} = {\text{Span\{ }}{{{v}}_{1}},\; \ldots ,\;{{{v}}_{n}}{\text{\}}},\quad {{\mathcal{Q}}_{n}} = {\text{Span\{ }}{{q}_{1}},\; \ldots ,\;{{q}_{n}}{\text{\}}}. \\ \end{gathered} $$
(90)

As is mentioned in [71], this method has high convergence rate in the case of SLAEs arising in finite element approximations of continuous multidimensional saddle problems, i.e., in mixed statements. An important factor is that at each iteration algebraic subsystems with the matrices \(\tilde {D}\) and \(K\) must be solved; in a certain sense, these matrices play the role of preconditioners. Their approximate inversion actually leads to two-level iterative processes in certain subspaces.

4 PARALLELIZATION AND PERFORMANCE ISSUES OF ITERATIVE METHODS

The modern understanding of algorithm quality includes two main characteristics—mathematical efficiency and performance of implementation on a particular supercomputer architecture. The first characteristic includes the construction and optimization of iterative methods with a high convergence rate and theoretical estimates of the required computational resources (the number of arithmetic operations and amount of memory). The second characteristic is purely practical—it is determined by the actual execution time of the algorithm for a certain class of problems, which depends on the scalability of the algorithm parallelization and on the programming technology on a specific computer platform.

The SLAEs of greatest interest for us have large size (108–1011) and sparse matrices with large condition numbers (1012–1015) and irregular structure. This not only leads to a large number of iterations, but also forces one to work with systems of distributed and hierarchical shared memory and also significantly speeds down data access.

The main quantitative characteristic of parallelization is the acceleration of computations

$${{S}_{p}} = {{T}_{1}}{\text{/}}{{T}_{p}},\quad {{T}_{p}} = T_{p}^{a} + T_{p}^{c},$$

where \({{T}_{p}}\) is the time of solving the problem on \(p\) processors, which is the sum of data exchange time and arithmetic operations execution time; in turn, these times are found by the approximate formulas

$${{T}^{a}} = {{\tau }_{a}}{{N}_{a}},\quad {{T}^{c}} = {{N}_{t}}({{\tau }_{0}} + {{\tau }_{c}}{{N}_{c}}).$$

Here \({{\tau }_{a}}\) and \({{N}_{a}}\) are the average time of an arithmetic operation execution and their total number, respectively, \({{N}_{t}}\) is the number of data exchanges, \({{\tau }_{0}}\) and \({{\tau }_{c}}\) are the waiting time and the time of sending one number, respectively, and \({{N}_{c}}\) is the average size of one communication.

Machine constants satisfy the inequalities \({{\tau }_{0}} \gg {{\tau }_{c}} \gg {{\tau }_{0}}\); therefore, for the algorithms to be constructed, we have obvious recommendations: try to minimize the size of communications, and make data exchanges in large portions, i.e., pre-accumulate data buffers. These conclusions are all the more true because interprocessor data transfers not only slow down the computational process but also are the most energy-consuming operations, and this becomes a significant factor in the costs of operating a supercomputer [93].

The strategy of parallelizing large grid SLAEs arising as a result of approximation of multidimensional boundary value problems is based on hybrid programming and additive domain decomposition methods with two-level iterations in Krylov subspaces. The upper level iterations (over subdomains) are performed using the message passing interface MPI between processes each of which solves (simultaneously) an algebraic subsystem in the corresponding subdomain. At each iteration, the values of approximate solutions on the interface between subdomains are exchanged. Naturally, all matrix and vector data for the subsystems are preliminary distributed between processes. The solution of the SLAE in each subdomain is parallelized using multithreading computations (like OpenMP) on multicore processors with shared memory. Additional acceleration can be achieved by vectorizing operations (AVX-like instruction sets based on SIMD (Single Instruction Multiple Data) technologies (see reviews in [18]). Unfortunately, presently there are no ready-to-use software development systems with automatic parallelization of algorithms; hence the success in computations speedup and scalability significantly depends on the skills of the programmer.

Note that the traditional block Jacobi–Schwarz methods underlying parallel domain decomposition algorithms are based on the spectral optimization of iterative processes. A possible alternative to them are projection block Cimmino methods [9, 94], which still are not sufficiently well studied and rarely used in practice.

From the parallelization viewpoint, implicit alternating direction or Peaceman–Rachford methods (see [90] and the review in [10]), which were intensively developed in the 1960–1970s seem promising. They have a record convergence rate of iterations for model grid boundary value problems. In recent decades, this direction has found a second wind in connection with the development of rational approximations in Krylov subspaces (see [95]), including the solution of Lyapunov and Sylvester matrix equations. Note that in [93] a method for increasing the degree of parallelization on the basis of expanding rational functions in partial fractions was proposed.

Computations can be significantly accelerated by using variable precision arithmetic. Traditionally, the solution of large SLAEs uses double precision arithmetic in which floating point numbers are represented by 64 bits; however, for extremely poorly conditioned matrices, quad precision (128 bits) is required. By contrast, at certain stages of the algorithm, single and even half precision (32 and 16 bits, respectively), which is significantly faster may be used. This approach is quite natural at first iterations when the error of the approximate solution is relatively high. Another option of using low precision is in DDMs when solving auxiliary problems in subdomains. It should be taken into account that such tricks require a thorough analysis of numerical errors of the method to ensure the stability of computations as a whole.

A further reserve for improving performance is code optimization, which can be achieved by using high-quality computational tools (e.g., SPARSE BLAS), by using various compiler options, and special properties of a particular supercomputer platform. In recent decades, a large number of high-quality software libraries for computational algebra have been developed (PETSс, HYPRE, MKL INTEL, and others, see the review in [22, 23]). Also pay attention to the works performed in the All-Russian Research Institute of Experimental Physics, Russian Federal Research Center (Sarov) [96] and to projects on integrated computational environments DUNE [97], INMOST [98], and BSM [1, 24].

The creation, maintenance, and development of mathematical methods and software for the efficient solution of SLAEs is a continuous science-intensive technological process, which includes regular experimental research that requires intelligent means of support, such as systems for automation of verifying and testing algorithms and their implementations, collections of problems (such as popular Sparse Matrix Collections offered by various developers) and numerous demo and training versions of software (in addition to traditional monographs, textbooks, and user guides with recommendations on the application of various algorithms for solving various practical problems).

5 CONCLUSIONS: PROBLEMS AND PROSPECTS OF DEVELOPING ITERATIVE PROCESSES

From the systems point of view, the high-performance solution of SLAEs is a widely demanded area of intellectual activity that is far from being limited by writing and justifying formulas of the algorithm; it also includes statements of new practical problems or theoretical ideas, technological aspects of software implementations, and issues of their effective application in supercomputer experiments.

New practical challenges of mathematical modeling will drastically increase the demand for solving interdisciplinary and inverse high resolution problems on real-life data, which makes the high-performance solution of SLAEs with tens or hundreds of billions of degrees of freedom and extremely poor conditioning. In recent decades, the computational algebra methods have been rapidly developing both theoretically and experimentally, and promising directions have developed here: low-rank approximations of matrices, including rational approximations, tensor methodologies graph-theoretic and combinatorial approaches, aggregation and segregation methods; moreover, novel ideas in traditional matrix factorizations, algebraic decompositions, multigrid technologies, etc. are appearing.

Another important factor is the ongoing accumulation of a vast number of various problems (interdisciplinary, methodological, practical, typical, and unique) and methods and technologies for their solution on computers of various classes. At the same time, there is a transition from the amount of processed data to quality, which requires the concept of developing a new generation of mathematical techniques and software. An important task is to create an active base of mathematical knowledge to ensure the automation or optimization of algorithm construction and their mapping to supercomputer architecture. A prototype of such a base is the project ALGOWIKI developed in the framework of Wikipedia technologies under the guidance of J. Dongarra and Vl.V. Voevodin. This task is within the realm of artificial intelligence and machine learning which, along with big data technologies, forms the foundation of modern modeling. The achieved scales of the triad problems–methods–computers lead to the fact that the integrated computing environment for solving SLAEs should become the form and content of a hierarchical infrastructure that constitutes the industrial-type instrumental environment and implements further stages of the development of high-tech computing and information technologies, the outlines of which are outlined in [1, 24, 93]. To prevent deep specialization from leading to the separation of computer theorists, programmers, end users, their joint efforts should be aimed at creating an intelligent ecosystem that supports interprofessional and human-machine interfaces both to intensify fundamental research and to quickly implement their innovations.