Keywords

1 Introduction

Tridiagonal Toeplitz systems of linear equations play an important role in many theoretical and practical applications. They appear in numerical algorithms for solving boundary value problems for ordinary and partial differential equations [12, 14]. For example, a numerical solution to the heat-diffusion equation of the following form

$$\frac{\partial u}{\partial t} (x,t)= \alpha ^2 \frac{\partial ^2 u}{\partial x^2} (x,t), \text { for } 0<x<l \text { and } 0<t,$$

with boundary conditions \(u(0,t)=u(l,t)=0\), for \( 0<t\) and \( u(x,0)=f(x)\), for \( 0\le x \le l\) can be found using finite difference methods that reduce to the problem of solving tridiagonal Toeplitz systems. Such systems also arise in piecewise cubic interpolation and splines algorithms [2, 11, 13]. Moreover, such systems are useful when we solve the 2D Poisson equation by the variable separation method and the 3D Poisson equation by a combination of the alternating direction implicit and the variable separation methods [13]. Banded Toeplitz matrices also appear in signal and image processing [1].

There are several methods for solving such systems [3,4,5, 7,8,9, 13,14,15]. The basic idea comes from Rojo [9]. He proposed a method for solving symmetric tridiagonal Toeplitz systems using LU decomposition of a system with almost Toeplitz structure together with Sherman-Morrison’s formula [4]. This approach has been modified to obtain new solvers for a possible parallel execution [5, 14]. A simple vectorized but non-parallel algorithm was proposed in [3]. A different approach was proposed by McNally et al. [8] who developed a scalable communication-less algorithm. It finds an approximation of the exact solution of a system with a given acceptable tolerance level. However, the algorithm does not utilize vectorization explicitly. Terekhov [13] proposed a highly scalable parallel algorithm for solving tridiagonal Toeplitz systems with multiple right hand sides. It should be noticed that well-known numerical libraries optimized for modern multicore architectures like Intel MKL, PLAPACK or NAG do not provide routines for solving tridiagonal Toeplitz systems. These libraries provide solvers for more general systems i.e.non-Toeplitz or dense Toeplitz systems. The case studied here is more detailed, but it allows to formulate more efficient solvers.

In this paper, we present two versions of a new divide and conquer parallel vectorized algorithm for finding the exact solution of tridiagonal Toeplitz systems of linear equations. As the starting point, we consider the splitting \(T=LR+P,\) where L, R are bidiagonal and P has only one non-zero entry [3]. Our new approach for solving bidiagonal Toeplitz systems is based on recently developed algorithms for solving linear recurrence systems [10, 12]. We discuss possible OpenMP implementations and show how to reduce the number of necessary synchronizations to improve the performance. Further improvements come from the proper data layout that allows to use cache memory and SIMD extensions of modern processors. Numerical experiments performed on Intel Xeon CPUs and Intel Xeon Phi show that our new implementations achieve good performance on multicore and manycore architectures. Moreover, they are more energy efficient than a simple sequential algorithm.

2 Parallel Algorithm for Tridiagonal Toeplitz Systems

Let us consider a tridiagonal Toeplitz system of linear equations \(T\mathbf{x} =\mathbf{b} \) of the following form

$$\begin{aligned} \left[ \begin{array}{ccccc} d &{} a &{} &{} &{} \\ 1 &{} d &{} a &{} &{} \\ &{} \ddots &{} \ddots &{} \ddots &{} \\ &{} &{} 1 &{} d &{} a\\ &{} &{} &{} 1 &{} d\\ \end{array} \right] \left[ \begin{array}{c} x_0\\ x_1\\ \vdots \\ \vdots \\ x_{n-1}\\ \end{array} \right] = \left[ \begin{array}{c} b_0\\ b_1\\ \vdots \\ \vdots \\ b_{n-1}\\ \end{array} \right] . \end{aligned}$$
(1)

For the sake of simplicity let us assume that \(n=2^m\), \(m \in \mathbb {N}\), and \(|d|>1+|a|\). Thus, T is not singular and pivoting is not needed to assure numerical stability. To find the solution to (1) we can follow the approach presented in [3] and decompose T as follows

$$\begin{aligned} \left[ \begin{array}{ccccc} d &{} a &{} &{} &{} \\ 1 &{} d &{} a &{} &{} \\ &{} \ddots &{} \ddots &{} \ddots &{} \\ &{} &{} 1 &{} d &{} a\\ &{} &{} &{} 1 &{} d\\ \end{array} \right] = \underbrace{ \left[ \begin{array}{ccccc} r_2 &{} &{} &{} &{} \\ 1 &{} r_2&{} &{} &{} \\ &{} \ddots &{} \ddots &{} &{} \\ &{} &{} 1&{} r_2 &{} \\ &{} &{} &{} 1&{} r_2\\ \end{array} \right] }_{L} \underbrace{ \left[ \begin{array}{ccccc} 1 &{} r_1 &{} &{} &{} \\ &{} 1 &{} r_1 &{} &{} \\ &{} &{} \ddots &{} \ddots &{} \\ &{} &{} &{} 1 &{} r_1 \\ &{} &{} &{} &{} 1\\ \end{array} \right] }_{R} + \left[ \begin{array}{ccccc} r_1 &{} 0 &{} \ldots &{} &{} 0 \\ 0&{} 0 &{} &{} &{} \vdots \\ \vdots &{} &{}\ddots &{} &{} \vdots \\ 0 &{} \ldots &{}\ldots &{} &{} 0\\ \end{array} \right] , \end{aligned}$$
(2)

where \(r_2=(d \pm \sqrt{d^2-4a})/2\) and \(r_1=d-r_2\). Using this formula we can rewrite the Eq. (1) as follows

$$\begin{aligned} \left[ \begin{array}{c} x_0\\ x_1\\ \vdots \\ x_{n-1}\\ \end{array} \right] + r_1 x_0 \underbrace{(LR)^{-1} \left[ \begin{array}{c} 1\\ 0\\ \vdots \\ 0\\ \end{array} \right] }_\mathbf{u } = \underbrace{(LR)^{-1} \left[ \begin{array}{c} b_0\\ b_1\\ \vdots \\ b_{n-1}\\ \end{array} \right] }_\mathbf{v } \end{aligned}$$
(3)

or simply \( \mathbf{x} + r_1 x_0 \mathbf{u} =\mathbf{v} . \) Then the solution to (1) can be found using

$$\begin{aligned} \left\{ \begin{array}{ll} x_0 =\frac{v_0}{1+r_1 u_0} \\ x_i = v_i - r_1 x_0 u_i , \quad i=1,\ldots ,n-1. \end{array} \right. \end{aligned}$$
(4)

Let \(\mathbf{e} _k=(\underbrace{0,\ldots ,0}_{k},1,0,\ldots ,0)^T \in \mathbb {R}^s\), \(k=0,\ldots , s-1\). Note that to apply (4) we only need to have vectors \(\mathbf{u} \) and \(\mathbf{v} \) that are solutions to systems of linear equations \(LR\mathbf{u} =\mathbf{e} _0\) and \(LR\mathbf{v} =\mathbf{b} \), respectively. The solution to the system of linear equations \(L R \mathbf{y} =\mathbf{f} \) can be found using a simple sequential algorithm based on the following recurrence relations

$$\begin{aligned} \left\{ \begin{array}{ll} z_0 =f_0/r_2\\ z_i = (f_i-z_{i-1})/r_2,\quad i=1,\ldots ,n-1, \end{array} \right. \end{aligned}$$
(5)

and

$$\begin{aligned} \left\{ \begin{array}{ll} y_{n-1} =z_{n-1}\\ y_i = z_i-r_1y_{i+1},\quad i=n-2,\ldots ,0. \end{array} \right. \end{aligned}$$
(6)

Thus, to solve (1) using (4), such a simple sequential algorithm based on (5) and (6) should be applied twice for \(LR\mathbf{u} =\mathbf{e} _0\) and \(LR\mathbf{v} =\mathbf{b} \), respectively. This sequential algorithm requires \(9n+O(1)\) flops. It should be pointed out that (5) and (6) contain obvious data dependencies, thus they cannot be parallelized and vectorized automatically by the compiler.

To develop a new parallel algorithm for solving \(L R \mathbf{y} =\mathbf{f} \) that could utilize vector extensions of modern multiprocessors, we apply the divide-and-conquer algorithm for solving first-order linear recurrence systems with constant coefficients [10, 12]. Let us assume that \(n=rs\) and \(r,s>1\). First, we arrange entries of L into blocks to obtain the following block matrix

(7)

Let \(\mathbf{z} _i=(z_{is},\ldots ,z_{(i+1)s-1})^T\) and \(\mathbf{f} _i=(f_{is},\ldots ,f_{(i+1)s-1})^T\). Then the lower bidiagonal system of linear equations \(L \mathbf{z} = \mathbf{f} \) can be rewritten in the following recursive form

$$\begin{aligned} \left\{ \begin{array}{ll} L_s \mathbf{z} _0 = \mathbf{f} _0\\ L_s \mathbf{z} _i = \mathbf{f} _i - B \mathbf{z} _{i-1},\quad i=1,\ldots ,r-1. \end{array} \right. \end{aligned}$$
(8)

Equation (8) reduces to the following form

$$\begin{aligned} \left\{ \begin{array}{ll} \mathbf{z} _0 =L_s ^{-1} \mathbf{f} _0\\ \mathbf{z} _i =L_s ^{-1} \mathbf{f} _i -z_{is-1} L_s ^{-1} \mathbf{e} _0,\quad i=1,\ldots ,r-1. \end{array} \right. \end{aligned}$$
(9)

Note that (9) has a lot of potential parallelism. Just after all vectors \(L_s ^{-1} \mathbf{f} _i\), \(i=0,\ldots ,r-1\), have been found, we can apply (9) to find \(\mathbf{z} _i\), \(i=1,\ldots ,r-1\), “one-by-one” using the OpenMP “for simd” construct. It is clear that to find \(\mathbf{z} _i\) we need the last entry of \(\mathbf{z} _{i-1}\). Thus, before calculating the next vector, all threads should be synchronized. Alternatively, we can find last entries of all vectors \(\mathbf{z} _i\) and then \(s-1\) first entries can be found in parallel without the need for the synchronization of threads.

Similarly, in case of the upper bidiagonal system \(R \mathbf{y} =\mathbf{z} \), assuming the same as previously, we get

(10)

The solution of the system (i.e. vectors \(\mathbf{y} _i,\, i=0,\ldots ,r-1)\), satisfies

$$\begin{aligned} \left\{ \begin{array}{ll} R_s \mathbf{y} _{r-1} = \mathbf{z} _{r-1}\\ R_s \mathbf{y} _i = \mathbf{z} _i - C \mathbf{y} _{i+1},\quad i=r-2,\ldots ,0. \end{array} \right. \end{aligned}$$
(11)

Finally, we get

$$\begin{aligned} \left\{ \begin{array}{ll} \mathbf{y} _{r-1} =R_s ^{-1} \mathbf{z} _{r-1}\\ \mathbf{y} _i =R_s ^{-1} \mathbf{z} _i -r_1 y_{(i+1)s} R_s ^{-1} \mathbf{e} _{s-1},\quad i=r-2,\ldots ,0. \end{array} \right. \end{aligned}$$
(12)

We have a similar situation as previously, but to find \(\mathbf{y} _i\), \(i=0,\ldots , r-2\), we need to know first entries of all vectors \(\mathbf{y} _i\), \(i=1,\ldots , r-1\). Then we can find other entries simultaneously. Such a parallel algorithm requires \(16n-3r-6s+O(1)\) flops.

3 Implementation and Results of Experiments

Following (9), (12) and (4) we can formulate two OpenMP versions of our algorithm for solving (1). They are presented in Fig. 1. The overall structure of both versions is the same. We assume that the value of s is a power of two, thus each column is properly aligned in memory (lines 14, 32). Moreover, each column occupies a contiguous block in memory. We have two kinds of loops. The first one (lines 12–19) does not utilize vector extensions, but can be executed in parallel. Note that the inner loop (lines 17–18) retrieves successive elements of columns, thus necessary entries can be found in cache memory. Lines 35–37 contain another kind of loop. It is a parallel loop that utilize vector extensions (using the OpenMP “for simd” construct). The difference between the versions is relatively small. In case of the first version, there is an implicit barrier after the inner loop 35–37, because we need the last entry of the previous column in the next iteration of the outer loop (lines 30–39). Alternatively, we find all last entries in a sequential loop (lines 26–28) and then the inner loop (lines 35–37) can be launched with the “nowait” clause. However, the explicit barrier must be issued after the outer loop (line 39) to ensure that all necessary computations have been completed.

Fig. 1.
figure 1

Two OpenMP versions of the parallel algorithm (abbreviated)

All experiments have been carried out on a server with two Intel Xeon E5-2670 v3 processors (CPU) (totally 24 cores, 2.3 GHz, 256-bit AVX2), 128 GB RAM and a server with Intel Xeon Phi Coprocessor 7120P (KNC, 61 cores with multithreading, 1.238 GHz, 16 GB RAM, 512-bit vector extensions) which is an example of Intel MIC architecture, running under Linux with Intel Parallel Studio ver. 2017. Experiments on Xeon Phi have been carried out using its native mode. We have tested Sequential implementation based on (5), (6), (4), optimized automatically by the compiler, and two versions of our parallel algorithm (Version_1 and Version_2). On both architectures (CPU and MIC), in most cases, the best performance of the parallel algorithm is obtained for one thread per core. However, on MIC for larger problem sizes, Version_1 achieves better performance for two threads per core.

Fig. 2.
figure 2

Execution time for various n and r on CPU: Version_1 (a), Version_2 (b)

Examples of the results are presented in Figs. 2, 3 and Tables 1, 2. Figures 2, 3 show the execution time for various \(n\in \{2^{26}\text { or } 2^{27}, \ldots ,\, 2^{29} \text { or } 2^{30}\}\) (depending on the architecture) and r.

It should be observed that the right choice of r and \(s=n/r\) is very important for achieving good performance. When a bigger value of r is used, the performance is better only to a certain value of r. A bigger r implies smaller s. Thus we need to find a trade-off between getting the benefits form separating one part of loop and making this loop too long. Both parallel implementations achieve the best performance when the value of r is a small multiple of the number of available cores.

Fig. 3.
figure 3

Execution time for various n and r on MIC: Version_1 (a), Version_2 (b)

Table 1. Execution time [s] and speedup for optimal values of r (CPU)
Table 2. Execution time [s] and speedup for optimal values of r (MIC)

Tables 1, 2 show the execution time and speedup obtained for optimal values of r. On CPU, Version_1 achieves better performance for smaller values of n, but on MIC the situation is reversed: Version_1 achieves better performance for bigger values of n. Better speedup (up to 30) and efficiency can be observed on MIC, where the use of vector extensions is crucial for achieving good performance.

4 The Energy Efficiency

Figure 4 and Table 3 present the exemplary results of our experiments concerning the energy efficiency of our implementations. Data for this plot have been collected on the server with two Intel Xeon E5-2670 v3 processors using Intel’s Running Average Power Limit (RAPL) [6]. This interface enables to measure the power consumption for CPUs and DRAMs. Figure 4 shows how the power consumption changes during the execution of the program, which comprises calls to Sequential, Version_1 and Version_2, respectively.

Fig. 4.
figure 4

Total power consumption required by all considered implementations.

We also present the total power consumption of CPUs and DRAMs for \(n=2^{30}\) and the optimal value of r (in case of the parallel implementations). We start with the sequential method (it takes 18.2 s), next we perform Version_1 (4.2 s) and Version_2 (3.8 s). It is clear that current power draw during the execution of Version_1 and Version_2 is much higher, but it only lasts for a short time.

The power consumption [J] for various problem sizes is presented in Table 3. We can observe that both parallel versions need about \(50\%\) of the energy required by Sequential.

Table 3. Total power consumption [J] on CPU required by all considered implementations

5 Conclusions and Future Work

We have presented the new vectorized parallel algorithm for solving tridiagonal Toeplitz systems of linear equations. Numerical experiments have shown that it achieves good performance and speedup on multicore and especially manycore architectures. Moreover, it is more energy efficient than the simple sequential algorithm optimized automatically by the compiler. We plan to show that our approach can be easily implemented on GPUs using OpenACC.