Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In general, the modern domain decomposition methods to solve very large systems of linear equations (SLAEs), which arise in the discrete approximation on the non-structured meshes of the multi-dimensional boundary value problems (BVPs) by the finite element or by other grid methods, can be presented by three main mathematical approaches: external Krylov’s type iterative process “on subdomains” which presents the additive Schwarz (or special block Jacobi) algorithm, simultaneous solving the auxiliary BVPs in the subdomains which can be carried out by a direct or an iterative algorithm, and preconditioning procedures to accelerate the external iterations, see [15] and the literature cited there.

The last factor is very important for strongly scalable parallelized tasks, because for a very large number of subdomains and corresponding block degree of freedom, one can observe the considerable stagnation of the iterative process. In recent decades, various versions of aggregation, deflation, and coarse grid correction accelerators have been investigated and applied successfully by many authors. The main goal of our paper consists namely in the numerical analysis of several versions of aggregation accelerating based on low rank matrix approximations in different coarse subspaces. The program implemtation of the algorithms is realized for the universal compressed sparse matrix format which is necessary to solve the practical problems.

The conventional parallel technologies of DDMs include two levels: application of MPI processes for corresponding subdomains, including interface communications between them at each outer (external) iteration, and implementation of the multi-thread computing for the “internal” parallelezation on the multi-core processors.

The problems are specified by three levels of the degrees of freedom: the number of unknowns of SLAE \((10^{8}-10^{11})\), the quantity of subdomains (\(10^{2}-10^{5}\), block dimension of the broblem), and coarse grid dimension \((10-10^{3})\) which determine the scalability of the parallelism of the general computational process.

The bottleneck of DDM approach is in a minimization of a communication time. It can be done by simultaneous data transfer and synchronized computations in the subdomains. In general, DDM performance depends on the convergence properties of the iterative algorithms and on the efficiency of the computational technologies whose variants are discussed later on. In Sect. 2, we describe the algebraic and structured representation of the multi-level preconditioned iterative processes in the Krylov subspaces. Section 3 is devoted to the mapping of the parallel algorithms under consideration onto the computational multi-processor system with the distributed and shared memory architecture. The results of numerical experiments and an analysis of the various approaches is carried out for different orders of the basic interpolation functions and for different placement of the coarse grid nodes. The efficiency of the proposed algorithms is demonstrated on the representative set of the model examples.

2 Statement of the Problem and Algorithms

Let us consider a SLAE

$$\begin{aligned} \begin{array}{l}\displaystyle Au=\sum \limits _{l^{\prime }\in \omega _{l}}a_{l,l^\prime } u_{l^\prime }=f, A=\{a_{l,l^{\prime }}\}\in \mathcal {R}^{N,N}, u=\{u_{l}\}, f=\{f_{l}\}\in \mathcal {R}^{N}, \end{array} \end{aligned}$$
(1)

with the sparse matrix of large order with real entries arising from some discrete approximation of a multi-dimensional BVP by the finite element or the finite volume or other grid methods; \(\omega _{l}\) means a set of the indices of off-diagonal entries in the l-th row of matrix A.

We can divide the total set of the vector indices \(\varOmega =\{l\}\) into P non-intersected subsets, or algebraic subdomains,

$$\begin{aligned} \begin{array}{l}\displaystyle \varOmega =\bigcup \limits ^{P}_{s=1}\varOmega _{s}, N=\sum \limits ^{N}_{s=1}N_{s}, \end{array} \end{aligned}$$
(2)

each containing approximately equal number of elements \(N_{s}\). For subdomains \(\varOmega _{s}\), let us denote their boundaries \(\varGamma _{s}^0\) and closures as the following:

$$\begin{aligned} \begin{array}{l}\displaystyle \varGamma _{s}\equiv \varGamma _{s}^0=\{l^{\prime }\in \omega _{l}, l\in \varOmega _{s}, l^{\prime }\notin \varOmega _{s}\}, \bar{\varOmega }_s^0 = \varOmega _s\bigcup \varGamma _s^0. \end{array} \end{aligned}$$
(3)

Also, we can define the boundary layers of \(\varOmega _{s}\):

$$\begin{aligned} \begin{array}{l}\displaystyle \varGamma ^{t}_{s}=\Big \{l^{\prime }\in \omega _{l}, l\in \bar{\varOmega }^{t-1}_{s}, \bar{\varOmega }^{t}_{s}=\bar{\varOmega }^{t-1}_{s}\bigcup \varGamma ^{t}_{s}, t=1,2,...,\varDelta _{s}\Big \}. \end{array} \end{aligned}$$
(4)

Parameter \(\varDelta _{s}\) presents the measure of an extension of the subdomain \(\varOmega _{s}\). The set of \(\bar{\varOmega }^{\varDelta _{s}}_{s}\) forms the algebraic decomposition of the original domain \(\varOmega \) into subdomains with parametrized overlapping. Hystorically, it is known that increasing of the overlapping yields the increasing of the iterative convergence of DDMs and increasing of the cost of each iteration. For the subvectors

$$ \bar{u}_{s}=\{u_{l}, l\in \bar{\varOmega }^{\varDelta _{s}}_{s}\}\in \mathcal {R}^{\bar{N}_{s}}, u=\bigcup \limits ^{P}_{s=1}\bar{u}_{s}, $$

the original system can be written in a block form

$$\begin{aligned} \begin{array}{l}\displaystyle A_{s,s}\bar{u}_{s}+\sum \limits _{s^{\prime }\in Q_{s}}A_{s,s^{\prime }}\bar{u}_{s^{\prime }}=f_{s}, s=1,...,P, \end{array} \end{aligned}$$
(5)

where \(Q_{s}\) is the set of subdomains which are adjacent to the extended subdomain \(\bar{\varOmega }^{\varDelta _{s}}_{s}\).

To solve (5), the generalized block Jacobi iterative process is used:

$$\begin{aligned} \begin{array}{l}\displaystyle \bar{B}_{s}(\bar{u}^{n+1}_{s}-\bar{u}^{n}_{s})=\bar{f}_{s}-(\bar{A}\bar{u}^{n})_{s}\equiv \bar{r}^{n}_{s}, \bar{u}^{n}_{s}\in \mathcal {R}^{\bar{N}_{s}}. \end{array} \end{aligned}$$
(6)

Here \(\bar{r}^{n}_{s}\) is the residual subvector and \(\bar{B}_{s}\) is some preconditioning matrix which takes into account the permutations of the “boundary” rows \(l\in \varGamma ^{\varDelta _{s}}_{s}\), because of using special interface conditions of Steklov-Poincare type between the neighbour subdomains in the Schwarz iterations, see [24] for details.

The vector \(u^{n}\) of the sought for solution of original SLAEs (1) is not defined uniquely in (6), because in the intersections of the neighbour subdomain \(\bar{\varOmega }^{\varDelta _{s}}_{s}\) we have several values of the vector components for the various s. In order to avoid such an indefiniteness, different approaches are used. We apply the restricted alternating Schwarz (RAS) slgorithm, which is based on using the restricting operators \(R_{s}\in \mathcal {R}^{N_{s},\bar{N}_{s}}\):

$$\begin{aligned} \begin{array}{l}\displaystyle u^{n}_{s}=R_{s}\bar{u}^{n}_{s}=\{u^{n}_{l}=(R_{s}\bar{u}^{n}_{s})_{l}, l\in \varOmega _{s}\}\in \mathcal {R}^{N_{s}}, \end{array} \end{aligned}$$
(7)

where the subdomains \(\varOmega _{s}, s=1,...,P\), define the domain decomposition without overlapping.

The RAS Jacobi type method can be written in the following form:

$$\begin{aligned} \begin{array}{l}\displaystyle u^{n+1}=u^{n}+B^{-1}_{ras}r^{n}, \\ B^{-1}_{ras}=R\hat{A}^{-1}W^{T}, \hat{A}=W^{T}\,A\,W= \text{ block-diag } \{A_{s,s}\in \mathcal {R}^{\bar{N}_{s},\bar{N}_{s}}\}, \end{array} \end{aligned}$$
(8)

\(W=[w_{1}...w_{P}]\in \mathcal {R}^{N,P}\) is a rectangular matrix, each its column \(w_{s}\) has the entries equal to one in the nodes from \(\bar{\varOmega }_{s}\) and has zero entries otherwise. Let us note that generally even if the original SLAE is symmetric, a preconditioning matrix \(B_{ras}\) from (8) is not a symmetric one. In addition, the inversion of the blocks \(A_{s,s}\) of the matrix \(\hat{A}\) is actually reduced to the simultaneous solution of independent subsystems in the corresponding subdomains.

We suppose that SLAE (1) is obtained from the approximation of a multi-dimensional BVP for partial differential equations by the finite element, finite volume or other method on some non-structured grid. For example, let the Dirichlet problem for the diffusion-convection equation

$$\begin{aligned} \begin{array}{l}\displaystyle - \frac{\partial ^{2}u}{\partial x^{2}} - \frac{\partial ^{2}u}{\partial y^{2}}+p\frac{\partial u}{\partial x}+q\frac{\partial u}{\partial y}=f(x,y), (x,y)\in \varOmega , \\ u|_{\varGamma }=g(x,y), \end{array} \end{aligned}$$
(9)

be solved in a computational domain \(\varOmega =(a_{x},b_{x})\times (a_{y},b_{y})\), where \(\varGamma \) is a boundary of \(\varOmega \), and the convection coefficients pq are, for simplicity, the given values. For the sake of brevity, we will use the symbol \(\varOmega \) to denote either the computational domain or the grid domain according to the context.

The given boundary value problem is approximated on a uniform grid

$$\begin{aligned} \begin{array}{l}\displaystyle x_{i}=a_{x}+ih_{x}, y_{j}=a_{y}+jh_{y}, \\ i=0,1,...,N_{x}+1; j=0,1,...,N_{y}+1; \\ h_{x}=(b_{x}-a_{x})/(N_{x}+1), h_{y}=(b_{y}-a_{y})/(N_{y}+1), \end{array} \end{aligned}$$
(10)

by a five-point scheme of the form

$$\begin{aligned} \begin{array}{l}\displaystyle (Au)_{l}=u_{l,l}u_{l}+a_{l,l-1}u_{l-1}+a_{l,l+1}u_{l+1}+a_{l,l-N_{x}}u_{l-N_{x}}+a_{l,l+N_{x}}u_{l+N_{x}}=f_{l}, \end{array} \end{aligned}$$
(11)

where l is a “global”, or natural, number of an inner grid node:

$$\begin{aligned} \begin{array}{l}\displaystyle l=l(i,j)\equiv i+(j-1)N_{x}=1,...,N=N_{x}N_{y}. \end{array} \end{aligned}$$
(12)

A particular view of the coefficients \(a_{l,l^{\prime }}\) in (11) can be different, and specific formulae can be found in [6, 7]. Eq. (11) are written for the inner grid nodes, moreover, for the nodes near the boundary, whose numbers are from a set of the indices \( i=1, N_{x}\) or \(j=1, N_{y}\), the values known from the boundary conditions of the solution are substituted into the corresponding equations and moved to their right-hand sides, so that the corresponding coefficients \(a_{l,l^{\prime }}\) in (11) equal zero (it is the so- called “constraining” procedure).

We can think of the isomorphism between the vector entries in (1)–(5) and the grid nodes: \(u_{l}\) is the value of the grid function u in the l-th node at the grid \(\varOmega \) which is a set of all nodes in the computational domain. The subdomains \(\varOmega _{s}\) in (2) can be redefined as the grid subdomains, and for the model problem (9) we present a simple decomposition of \(\varOmega \) into a union of an identical non-interesecting rectangle subdomains:

$$ \varOmega =\bigcup \limits ^{P}_{s=1}\varOmega _{s}, P=P_{x}P_{y}, $$

each containing an equal number of the grid nodes

$$ M=m_{x}m_{y}, N_{x}=P_{x}m_{x}, N_{y}=P_{y}m_{y}, N=PM. $$

One can find that the subdomains form a two-dimensional macrogrid, where each macrovertex can be numbered by a pair of indices pq (similarly to the grid node indices ij), and a “continuous” number of a subdomain is defined as

$$\begin{aligned} \begin{array}{l}\displaystyle s=s(p,q)\equiv p+(q-1)P_{x}=1,...,P, \\ p=1,...,P_{x}; q=1,...,P_{y}. \end{array} \end{aligned}$$
(13)

We now turn from continuous numbering of nodes to their subdomain-by-subdomain ordering: at first, we number all the nodes in \(\varOmega _{1}\), then in \(\varOmega _{2}\), etc. The vector components uf are ordered correspondingly, so that the SLAE (11) takes the block-matrix form (5), where \(\bar{u}_{s}\in \mathcal {R}^{N_{s}}\) means a subvector of the vector u, whose components correspond to the nodes from the grid subdomain \(\varOmega _{s}\), and \(Q_{s}\) means a set of the numbers of the grid subdomains adjacent to subdomain \(\varOmega _{s}\). Hereinafter we assume that a local node ordering in every subdomain is a natural one: local pairs of indices \(i^{\prime }=1,...,m_{x}; \; j^{\prime }=1,...,m_{y}\) are introduced and a continuous number is determined by the formula \(l^{\prime }=i^{\prime }+(j^{\prime }-1)m_{x}\) similar to (12).

The rate of convergence of the iterative process (8) depends on the number of the subdomains, or more precisely, on the diameter of a graph representing a macrogrid formed by the decomposition. This can be clearly explained by the fact that on a single iteration the solution perturbation in one subdomain is transmitted only to the neighbouring, or adjacent, subdomains. To speed up the iterative process, it is natural to use not only the nearest but also the remote subdomain couplings at every step. For this purpose, different approaches are used in decomposition algorithms: deflation, coarse grid correction, aggregation, etc., which to some extent are close to the multigrid principle as well as the low-rank approximations of matrices, see numerous publications cited at a special site [8].

We will consider the following approach based on an interpolation principle. Let \(\varOmega _{c}\) be a coarse grid with the number of nodes \(N_{c}\ll N\) in the computational domain \(\varOmega \), moreover, the nodes of the original grid and the coarse grid may not match.

Let us denote by \(\varphi _{1},...,\varphi _{N_{c}}\) a set of the basis interpolating polynomials of order \(M_{c}\) on the grid \(\varOmega _{c}\) which are supposed to have a finite support and without loss of generality form an expansion of the unit, i.e.

$$ \sum \limits ^{N_{c}}_{k=1}\varphi _{k}(x,y)=1. $$

Then a sought for solution vector of SLAE (1) can be presented in the form of an expansion in terms of the given basis:

$$\begin{aligned} \begin{array}{l}\displaystyle u=\{u_{i,j}\approx u^{c}_{i,j}=\sum \limits ^{N_{c}}_{k=1}c_{k}\varphi _{k}(x_{i},y_{j})\}=\varPhi \hat{u}+\psi , \end{array} \end{aligned}$$
(14)

where \(\hat{u}=\{c_{k}\}\in \mathcal {R}^{N_{c}}\) is a vector of the coefficients of the expansion in terms of the basis functions, \(\psi \) is an approximation error, and \(\varPhi =[\varphi _{1}...\varphi _{N_{c}}]\in \mathcal {R}^{N,N_{c}}\) is a rectangular matrix with every k-th column consisting of the values of the basis function \(\varphi _{k}(x_{i},y_{j})\) at N nodes of the original grid \(\varOmega \) (most of the entris of \(\varPhi \) equal zero in virtue of the finiteness of the basis). The columns, or the functions \(\varphi _{k}\), can be treated to be the orthonormal ones but not necessarily. If at some k-th node \(P_{k}\) of the coarse grid \(\varOmega _{c}\) only one basis function is a nonzero one \((\varphi _{k}(P_{k^{\prime }})=\delta _{k,k^{\prime }})\), then \(\hat{u}_{k}=c_{k}\) is the exact value of the sought for solution at the node \(P_{k}\). With a substitution of (14) into the original SLAE, one can obtain the system

$$\begin{aligned} \begin{array}{l}\displaystyle A\varPhi \hat{u}=f-A\psi , \end{array} \end{aligned}$$
(15)

and if to multiply it by \(\varPhi ^{T}\) one can obtain

$$\begin{aligned} \begin{array}{l}\displaystyle \hat{A}\hat{u}\equiv \varPhi ^{T}A\varPhi \hat{u}=\varPhi ^{T}f-\varPhi ^{T}A\psi \equiv \hat{f}\in \mathcal {R}^{N_{c}}. \end{array} \end{aligned}$$
(16)

Assuming further that the error \(\psi \) in (14) is sufficiently small and omitting it, one can obtain a system for an approximate coarse grid solution \(\check{u}\):

$$\begin{aligned} \begin{array}{l}\displaystyle \hat{A}\check{u}=\varPhi ^{T}f\equiv \check{f}. \end{array} \end{aligned}$$
(17)

If the matrix A is a non-singular matrix and \(\varPhi \) is the full-rank matrix( the rank is much less than N ), we assume these facts to hold further, then from (16) we have

$$ u\approx \tilde{u}=\varPhi \check{u}=\varPhi \hat{A}^{-1}\hat{f}=B^{-1}_{c}f, \; B^{-1}_{c}=\varPhi (\varPhi ^{T}A\varPhi )^{-1}\varPhi ^{T}. $$

For the error of the approximate solution we have

$$\begin{aligned} \begin{array}{l}\displaystyle u-\tilde{u}=(A^{-1}-B^{-1}_{c})f. \end{array} \end{aligned}$$
(18)

The error of the approximate solution can also be presented via the error of the approximation \(\psi \). Subtracting Eqs. (16) and (17) term by term we have

$$ \hat{A}(\hat{u}-\check{u})=-\varPhi ^{T}A\psi $$

what yields the required equation:

$$ u-\tilde{u}=\varPhi \hat{u}+\psi -\varPhi \check{u}=\psi -B^{-1}_{c}A\psi . $$

The matrix \(B^{-1}_{c}\) introduced above can be regarded as a low rank approximation of the matrix \(A^{-1}\) and used as a preconditioner to build an iterative process. In particular, for an arbitrary vector \(u^{-1}\) we can choose an initial guess as

$$\begin{aligned} \begin{array}{l}\displaystyle u^{0}=u^{-1}+B^{-1}_{c}r^{-1}, r^{-1}=f-Au^{-1}. \end{array} \end{aligned}$$
(19)

In doing so, the corresponding initial residual \(r^{0}=f-Au^{0}\) will be orthogonal to a coarse grid subspace

$$\begin{aligned} \begin{array}{l}\displaystyle \hat{\varPhi }= \text{ span } \{\varphi _{1},...,\varphi _{N_{c}}\} \end{array} \end{aligned}$$
(20)

in the sense of fulfilling the condition

$$\begin{aligned} \begin{array}{l}\displaystyle \varPhi ^{T}r^{0}=\varPhi ^{T}(r^{-1}-A\varPhi \hat{A}^{-1}\varPhi ^{T}r^{-1})=0. \end{array} \end{aligned}$$
(21)

The relations given in [10] are the basis for the conjugate gradient method with deflation, wherein an initial direction vector is chosen by the formula

$$\begin{aligned} \begin{array}{l}\displaystyle p^{0}=(I-B^{-1}_{c}A)r^{0}, \end{array} \end{aligned}$$
(22)

which ensures that the following orthogonality condition holds:

$$\begin{aligned} \begin{array}{l}\displaystyle \varPhi ^{T}Ap^{0}=0. \end{array} \end{aligned}$$
(23)

Further iterations are implemented using the following relations:

$$\begin{aligned} \begin{array}{l}\displaystyle u^{n+1}=u^{n}+\alpha _{n}p^{n}, r^{n+1}=r^{n}-\alpha _{n}Ap^{n}, \\ p^{n+1}=r^{n+1}+\beta _{n}p^{n}-B^{-1}_{c}Ar^{n+1}, \\ \alpha _{n}=(r^{n},r^{n})/(p^{n},Ap^{n}), \beta _{n}=(r^{n+1},r^{n+1})/(r^{n},r^{n}). \end{array} \end{aligned}$$
(24)

In this method, which we will refer to as DCG, at every step the following relations hold:

$$\begin{aligned} \begin{array}{l}\displaystyle \varPhi ^{T}r^{n+1}=0, \varPhi ^{T}Ap^{n+1}=0. \end{array} \end{aligned}$$
(25)

If now we turn back to the additive Schwarz method (11), we can try to accelerate it by the coarse grid preconditioner \(B^{-1}_{c}\) (in addition to the preconditioner \(B^{-1}_{ras}\)). We will consider this point in a more general formulation assuming that matrix A is a non-symmetric one and that there are several but not only two preconditioning matrices. Moreover, the preconditioners can change from iteration to iteration what corresponds to the so-called dynamic, or flexible, preconditioning.

The SLAE with the non-symmetric matrix A is solved by the well-known BiCGStab algorithm [1].

3 Parallel Technologies of DDM

The objectives of our research consist in the verification, testing, and a comparative analysis of the efficiency of different algorithms and computational technologies of solving big sparse SLAEs aimed at their optimization and including into the KRYLOV library [9] of the parallel algebraic solvers. The main requirements to develop a proper software are high and scalable performance and no formal restrictions on the orders of the SLAEs and on the number of the processors and computational cores used. According to [3], a strong and a weak scalability can be distinguished. The first one describes a decrease in the execution time of one big problem with an increase of the number of computing devices, while the second one stands for approximate preservation of the solution time while increasing the dimension (the number of degrees of freedom) of the problem and the number of processors and/or cores.

The algorithms were coded with taking into account the architecture of the SSCC SB RAS cluster [11] (where KRYLOV library is available) but without GPGPU usage as their effective utilization in the considered domain decomposition methods has its own technological and computational complexity and requires a special study.

Computations are carried out in the following natural way: if a computational domain is divided into P subdomains than the solution is performed on \(P+1\) MPI-processes (one is the root process and other ones correspond to their own subdomains). During the program execution, the root MPI-process is used to accumulate partial dot products from the subdomains thus also keeping the synchronization of the computational work in the subdomains and upon completion it accumulates the whole sought for solution vector.

A scalable parallelization of the algorithms is provided by synchronization of the calculations in subdomains and by a minimization of the time losses during interprocessors communication. The solutions to auxiliary algebraic subsystems in the subdomains are obtained simultaneously on the multicore CPUs with the usage of multithread OpenMP calculations. The reduced system 17 is formed and solved in all the processes.

As algorithms from KRYLOV library are designed to solve large sparse SLAEs arising from an approximation of multidimensional boundary value problems on non-structured grids, the well-known compressed sparse row format of the matrix storage is used to keep the non-zero matrix entries. The global matrix A is formed in the root MPI-process (in the simplest implementation) at the preliminary stage, and then the distributed storage of the block rows \(\bar{A}_{s}=\{A_{s,s^{\prime }}, s^{\prime }\in Q_{s}\}\) from (5) is done for the s-th extended subdomain (i.e., on the corresponding MPI-processes). If the original matrix is very big, it can be stored in a row-blocked form already and then the block rows be distributed among computational nodes (subdomains) thus keeping the global matrix on the root MPI-process is not a bottleneck for the problem under consideration.

An important condition for the high performance computing consists in the matching the arithmetic calculations and data communications between the subdomains by using MPI unblocked send-receive means. Moreover, the volume of the data transfer is very small as only the short vectors corresponding to the number of grid ponits on mutual boundary between the subdomains should be exchasnged.

Let us note that for the examined grid boundary value problems, a two-dimensional balanced domain decomposition into subdomains is considered, when for an approximately equal number of nodes \(N_S \approx N/P\) in every subdomain the macrogrid daimeter d (for a macrogrid composed of subdomains) is equal, approximately, to \(\sqrt{P}\). As the number of the iterations of the additive Schwarz method even with the usage of the preconditioned Krylov methods is proportional to \(d^{\gamma }, \gamma > 0\), this yields a significant advantage over a one-dimensional decomposition for which \(d \approx P\).

A solution to the isolated SLAEs in \(\varOmega _s\) is produced by the direct or iterative method requiring \((N/P)^{\gamma _1}, \gamma _1 > 0 \) operations at every step of the two-level process. As it is necessary to exchange the data corresponding to peripheral nodes of the adjacent subdomains only, the volume of such an information is much less and proportional to \((N/P)^{\gamma _1/2}\) (for two-dimensional BVPs) thus allowing one to carry out arithmetic and communication operations simultaneously.

A high performance of the code based on the presented approach is ensured by an active usage of the standard functions and vector-matrix operations from BLAS and Sparse BLAS included into Intel MKL [12].

4 Results of Numerical Experiments

We present the results of methodical experiments on solving five-point SLAEs for 2D Dirichlet problem in the unit square computational domain on the square grids with the number of nodes \(N=128^{2}\) and \(256^{2}\). Calculations were carried out via \(P=2^{2}, 4^{2}, 8^{2}\) MPI-processes each of which corresponded to the subdomains forming the square macrogrid. Iterations over the subdomains were realised with the help of BiCGStab algorithm [1] with the stopping criterion \( ||r^{n}||_{2}\le \varepsilon ||f||_{2}, \varepsilon =10^{-8}. \) Solving of the auxiliary subdomain subsystems was carried out by the direct solver PARDISO from Intel MKL. The most time-consuming part of LU matrix decomposition was done only once before the iterations.

In the Table 1, each cell contains the numbers of iterations over the subdomains and the times of SLAEs solving (in seconds) on the grids \(128^{2}\) and \(256^{2}\). The upper figure in each cell corresponds to the zero convection coefficients while the bottom figure – to the convection coefficients \(p=q=4\)). Domain decompositions were made for equal overlapping parameters in subdomains: \(\varDelta _{s}=\varDelta =0,1,2,3,4,5\). Interface boundary conditions of the Dirichlet type between the adjacent subdomains were used in all experiments.

Table 1. The numbers of iterations and the solution times (in seconds) on the grids \(128^{2}\) and \(256^{2}\) for different overlapping parameter \(\varDelta \)

The results demonstrate that with \(\varDelta \) increasing up to 5, the number of the iterations reduces 3 - 4 fold, but when the overlapping value is big, the time of a subdomain solving begins to increase. So, for almost all the grids and the numbers of MPI-processes (subdomains), the optimal \(\varDelta \) value is approximately 3 – 4 in terms of the total execution time. If the convection coefficients pq are nonzero ones, the number of the iterations increases by approximately 30–50 %. Let us notice that the figures for 4 and 16 subdomains were obtained in the experiments when each MPI-process was ran on its own cluster node in exclusive mode while the data for 64 subdomains were got in a series of experiments on cluster nodes that were given to the tasks in non-exclusive mode yielding some increase of the execution time. So the last line of the Table does not present “pure” speedup of the algorithm.

In the Tables below, for the sake of bravity, the results for the Poisson equation are presented, i.e. when there are no convection coefficients in equation (1). The experiments shown that with the moderate values of pq (\(|p|+|q|< 50\)) the behavior of the iterative process varied slightly.

The numerous results for the different model and practical problems shown that the behavior of iterations varied slightly in the considered algorithms when the initial error varied. The experiments given above were hold for the initial guess \(u^{0}=0\) and the exact SLAE solution \(u=1\).

Table 2 shows the effect of applying of two deflation methods when the conjugate gradient algorithm without any additional preconditioning and without additive Schwarz method is used for three square grids with different numbers of nodes N and for different macrogrids with the number of the macronodes \(N_c\). The macronodes are taken in the vicinity of the subdomain corners, i.e. when \(P = 2^2, 4^2, 8^2\) the numbers of the macronodes, or the values of \(N_c\), are \(3^2, 5^2\) and \(9^2\) respectively. The basis functions \(\phi _k(x,y)\) were the bilinear finite functions. Three right columns have the number of the iterations (the upper figures in every cell) for the single orthogonalization of the form (23) while the iteration number for the orthogonalization (25) on every iteration is the bottom figure. If to compare these data with the algorithm when the deflation is not used at all (the column with \(P=0\), i.e. no macrogrid is used) one can see the acceleration up to three times when P increases. However, it should be taken into account that an implementation of the multiple orthogonalization makes each iteration more expensive, so an additional investigation is required to optimize the algirithms on practice.

Table 2. The deflation influence in the conjugate gradient method without additive Schwarz

The results from Table 3 present the same data but when using the additive Schwarz method with the domain decomposition into P subdomains. The numbers of the coarse grid nodes are taken the same as that of the macronodes for Table 2. The basis functions \(\phi _k(x,y)\) as in the previous series of experiments from Table 2 were the bilinear finite ones. Every cell of Table 3 contains the numbers of the iterations carried out without deflation (the upper figures in each cell) and the numbers of the iterations for the single orthogonalization of the initial guess (the bottom figures in each cell). In every cell, the first column gives the data for the zero value of the overlapping parameter \(\varDelta \), the second column – for \(\varDelta = 1\), and the third column – for \(\varDelta = 2\).

Table 3. Aggregation influence in the additive Schwarz method (decomposition with different overlapping parameter \(\varDelta \))

The presented results for the considered grids and macrogrids have approximately the same character as in Table 2 when the increasing of the deflation space yields to the decreasing of the iteration number together with the increasing of the amount of computations at each step. In these experiments, the outer iterations were carried out by the BiCGStab method.

Let us note that the experiments for Table 3 were hold for the initial guess \(u^{0}=0\) and the exact SLAE solution \(u(x_{i},y_{j})=x^{2}_{i}-y^{2}_{j}\). Naturally, the efficiency of the considered “interpolation” deflation depends on the behaviour of the solution sought for. For example, if it is, e.g., \(u(x_{i},y_{j})=x-y\), then the usage of the bilinear basis functions \(\varphi _{k}(x,y)\) for \(N_{c}\ge 4\) yields to the convergence in one iteration, and this fact was confirmed in the experiments.

5 Conclusion

We have studied experimentally the efficiency and the performance of several advanced approaches for domain decomposition methods. The results presented demonstrate a considerable increasing of the convergence rate of the iterative process when the corresponding overlapping parameters and a coarse grid correction are used to accelerate the additive Schwarz algorithm. It should be mentioned that the augmented versions of DDM have been implemented without additional communication losses. Obviously, the further research should be held to obtain some practical recommendations to optimize the performance when a combination of various parametrized approaches are used simultaneously. The efficient code optimization for multi-GPGPU and a multi-core implementation is a challenge technology but it is an open question now. For example, it is interesting to analyse two-level iterative FGMRES procedure with some dynamic stopping criterion in subdomains and various basis functions in low rank matrix approximations.