1 Introduction

High-order numerical methods for PDEs, such as spectral methods, the spectral-element method, and the discontinuous Galerkin method, play a significant role in numerical simulations that require high-order precision and high level fidelity. However, computational complexities of these methods are substantially higher than those of lower order methods. For example, when solving a \(d\)-dimensional system of coupled elliptic equations with non-periodic boundary conditions, the complexity of a polynomial-based, scalable spectral method can be as high as O(\(LN^{d+1}\)) (cf. [4]), where \(L\) is the number of equations, \(d\) is the dimension, and \(N\) is the number of points in each dimension. Therefore, it is extremely important to improve the efficiency of high-order methods such as the spectral methods. Nowadays, general-purpose computing on graphics processing units (GPGPU) is considered one of the most popular approach in parallel computing (cf. [7, 14, 17, 27]), and has been successfully applied to many topics of high-order methods (cf. [2, 5, 8, 12, 15]). The focus of our paper is establishing a unified framework for GPU-accelerated spectral methods for systems of coupled elliptic equations.

In this paragraph, we review existing implementations of high-order methods on GPUs. In [15], the authors developed the nodal discontinuous Galerkin method on the GPU, which is considered the first GPU implementation of high-order (spectrally accurate in space) methods on general domains. For problems with periodic boundary conditions, the pseudo-spectral methods with Fourier basis were developed (cf. [2, 12]). For problems with non-periodic boundary conditions on the single domain, a GPU parallelized spectral collocation method was proposed in [5] for single elliptic equations and Fourier-continuation solvers were developed in [8, 22] for hyperbolic equations. The trend of designing and implementing high-order methods on GPUs has gradually emerged. However, according to the author’s knowledge, the research on fast, direct, polynomial-based spectral solvers on the GPU (besides [5]) is still limited.

Though the GPU-accelerated spectral collocation method has been successfully implemented on the GPU (cf. [5]), it is still open how to extend another important variant of spectral methods, the spectral-Galerkin method, for systems of coupled equations on the GPU. The first contribution of this paper is proposing a novel version of the spectral-Galerkin method that is suitable for the GPU computing. In [4], the authors proposed an efficient spectral method for solving systems of equations and the complexity of the algorithm is O(\(LN^{d+1}\)). We will show that, after the first investigation for the single elliptic equation in [5], the methodology proposed in [4] for the system of equations can be successfully migrated from the CPU to the GPU. Therefore, one task of this paper is to tailor legacy spectral-Galerkin methods to achieve high speedups on the new GPU platform. The main issue is how to fit the diagonalization technique into the new generation CUDA tool. Then, a crucial step within this frame is to introduce a new way of forming the right hand side inner-products. Instead of performing the physical-spectral transformation in the legacy spectral-Galerkin method, we directly use the Gauss quadratures to calculate the inner-products. This technique eliminates the implementation overhead of transforming the basis representation to the spectral representation, while maintaining the spectral accuracy and the same algorithmic complexity as the legacy algorithm. Next, we want to accelerate the bulk matrix-matrix multiplications and the decoupled step by using the GPU, while maintaining the same order of complexity as the CPU algorithm. Another issue is how to align high dimensional data such that the memory access in the GPU can be coalesced in different stages of the algorithm. Finally, we illustrate how to minimizes the memory transfer between the host and the device, which is critical for GPU implementations. Without regular throughput and completely coalesced global memory access, significant performance enhancement on the GPU is impossible.

Based on the new spectral-Galerkin method and the previous spectral collocation method, we are able to propose a unified procedure for GPU-accelerated spectral methods for systems of coupled second-order equations, as the main contribution of this paper. We will briefly review the spectral collocation method and then explain our key idea in the design of the spectral-Galerkin method for a 2-D model equation. The algorithm is very flexible and can easily be extended to the 3-D case. To initialize the matrix diagonalization procedure, the eigenvalues and eigenvectors in different dimensions need to be precomputed on the host, then shipped to the device, since they will be used repeatedly in a typical marching scheme for time-dependent problems. The basic procedure of the proposed framework is illustrated through the flow chart in Fig. 1.

Fig. 1
figure 1

A algorithm flow chart of the spectral collocation method and the spectral-Galerkin method

We report an order-of-magnitude speedup for both 2-D and 3-D systems, by comparing an eight-core threaded, MKL-linked CPU program with a GPU implementation on the Tesla K20 device. Furthermore, valuable information on the comparison of two methods are provided. We found that the spectral collocation method is more expensive in terms of the initialization and the memory transfer, while the spectral-Galerkin method provides better accuracies but is almost twice expensive in terms of core computations on the GPU. Finally, to demonstrate the wide applicability of the solvers, we apply the spectral collocation method to the 2-D FitzHugh-Nagumo equation in the computational neuroscience and apply the spectral-Galerkin method to the 3-D Cahn-Hilliard equation in computational materials science, and the computing times for these time-dependent problems have been dramatically reduced. Our work indicates that high performance computing and high-order methods are an excellent match for accelerating scientific simulations and computing.

This paper is organized as follows. Details of the design for the spectral collocation method and the spectral-Galerkin method are described for systems of elliptic equations in Sect. 2. They are important for an in-depth discussion in Sect. 3 of the GPU parallelization. In Sect. 4, we first present benchmark accuracies and speedups for solving model systems in 2-D and 3-D, on the CPU and the GPU respectively. Then, we illustrate how to apply the accelerated solvers to time-dependent problems using the 2-D FitzHugh-Nagumo equation and the 3-D Cahn-Hilliard equation as examples. We end the paper with concluding remarks in the final section.

2 Design of Spectral Methods for Systems of Coupled Equations

In this section, we describe details of spectral methods for systems of coupled equations. We first decribe the spectral-Galerkin method, then extend it to the spectral collocation method. In terms of the presentation, we take an ‘incremental’ approach from \(L=1\) (one equation) to \(L=2\) and to a general value of \(L\), and from 1-D to 3-D.

2.1 Spectral-Galerkin Method

In this subsection, a preliminary is first given and followed by the description of the core algorithm in next subsection. Then, we generalize the simple case to systems of more than two equations in Sect. 2.1.4 and to the 3-D case in Sect. 2.1.5.

2.1.1 Elements in the 1-D Case

We start from describing the spectral-Galerkin method in the 1-D case. Consider

$$\begin{aligned} \alpha u - u'' = f, \text { in } \Omega =(-1,1), \quad \mathcal {B}u|_{\pm 1}:= (a_{\pm }u+b_{\pm }u_x)|_{\pm 1} =0, \end{aligned}$$
(2.1)

where \(a_\pm \) and \(b_\pm \) are constants such that the above equation is well-posed. The boundary operator \(\mathcal {B}\) can be of the Dirichlet, Neumann, or Robin type.

Let \(\mathbb {P}_N\) be the polynomial space of degree \(N\). Define that

$$\begin{aligned} \mathbb {X}_N = \{u\in \mathbb {P}_N : \mathcal {B}u|_{\pm 1} =0 \}. \end{aligned}$$
(2.2)

To illustrate the key idea, we consider the Legendre-Galerkin method, which consists of finding \(u_N \in \mathbb {X}_N\), such that,

$$\begin{aligned} \alpha (u_N,w ) - (u_N^\prime \prime ,w ) = (\mathcal {I}_N f,w ), \quad \forall w \in \mathbb {X}_N. \end{aligned}$$
(2.3)

In the above, \((v_1, v_2)\) denotes the \(L^2\)-inner product of \(v_1\) and \(v_2\) over \(\Omega \), and \(\mathcal {I}_N\) is the interpolation operator from \(C(\Omega )\) to \(\mathbb {P}_N\) based on the Legendre-Gauss-Lobatto points.

Let \(L_k\) be the Legendre polynomial of degree \(k\), it is shown (cf. [26]) that there exists a unique pair \((a_k,b_k)\) such that

$$\begin{aligned} \tilde{\phi }_k = L_k+a_k L_{k+1} + b_k L_{k+2} \in \mathbb {X}_N. \end{aligned}$$
(2.4)

Throughout the paper, \(N+1\) denotes the number of quadrature points (indexed from \(0\) to \(N\)), and \(q+1\) denotes the number of basis (indexed from \(0\) to \(q\)). According to (2.4), we have \(q=N-2\). Consequently, we denote

$$\begin{aligned} \mathbb {X}_N=\text {span}\{\tilde{\phi }_0,\tilde{\phi }_1,\ldots ,\tilde{\phi }_q\}. \end{aligned}$$
(2.5)

Without loss of generality, we focus on the case of the homogeneous Dirichlet boundary condition, for which we have

$$\begin{aligned} a_k = 0, \quad b_k = -1, \quad 0 \leqslant k \leqslant q. \end{aligned}$$
(2.6)

Thanks to the orthogonality of the Legendre polynomials, it is easy to see that \((\tilde{\phi }_j'',\tilde{\phi }_i)=0\) for \(j<i\). One can easily check by integration by part that \((\tilde{\phi }_j'',\tilde{\phi }_i)=(\tilde{\phi }_j,\tilde{\phi }_i'')=0\) for \(j>i\). Hence, by setting

$$\begin{aligned} {\phi }_k = d_k \tilde{\phi }_k \quad \text {with }\; d_k = \frac{1}{\sqrt{-({\phi ''}_k, \tilde{\phi }_k)}}, \quad 0 \leqslant k \leqslant q, \end{aligned}$$
(2.7)

we have \(S_{ij}:=-(\phi _j^\prime \prime ,\phi _i)=\delta _{ij}\), i.e., the stiffness matrix, \(S\), is an identity matrix. Obviously, we also have \(M_{ij}:=(\phi _j,\phi _i)=0\) if \(|i-j|>2\), i.e., the mass matrix, \(M\), is symmetric and penta-diagonal. Denote

$$\begin{aligned} u_N&= \sum _{i=0}^q \hat{u}_i \phi _i,\quad \hat{u}=(\hat{u}_0, \hat{u}_1,\ldots ,\hat{u}_q)^T;\nonumber \\ \hat{f}_i&= (\mathcal {I}_N f,\phi _i),\quad \hat{f}=(\hat{f}_0, \hat{f}_1,\ldots ,\hat{f}_q)^T. \end{aligned}$$
(2.8)

Then the linear system (2.3) is written as

$$\begin{aligned} (\alpha M + I) \hat{u} = \hat{f}, \end{aligned}$$
(2.9)

which is a banded, symmetric positive definite system that can be efficiently solved. Readers refer [23] for full advantages of the Legendre-Galerkin method.

2.1.2 Spectral-Galerkin Method for a Single 2-D Equation

Next, consider the 2-D elliptic equation:

$$\begin{aligned} \alpha u - \Delta u = f, \text { in } \Omega =(-1,1)^2, \quad u|_{\partial \Omega } =0. \end{aligned}$$
(2.10)

The same number of modes, \(N\), is used in each dimension in order to simplify notations. For the homogeneous Dirichlet boundary condition, the basis sets in \(x\) and \(y\) dimensions are

$$\begin{aligned} \begin{aligned}&\mathbb {X}_N = \text {span}\{ \phi _k(x), 0 \leqslant k \leqslant q\}, \quad \phi _k(x) = L_k(x)-L_{k+2}(x),\\&\mathbb {Y}_N = \text {span}\{ \phi _k(y), 0 \leqslant k \leqslant q\}, \quad \phi _k(y) = L_k(y)-L_{k+2}(y). \end{aligned} \end{aligned}$$
(2.11)

Define the approximating space in 2-D as \(\mathbb {W}_N=\mathbb {X}_N \otimes \mathbb {Y}_N\), the tensor product space of two 1-D spaces in (2.11). Then, the Legendre-Galerkin method for (2.10) consists of finding \(u_N\) in \(\mathbb {W}_N\), such that,

$$\begin{aligned} \alpha (u_N,w ) - (\Delta u_N, w ) = (\mathcal {I}_N f, w),\quad \forall w \in \mathbb {W}_N, \end{aligned}$$
(2.12)

where \(\mathcal {I}_N\) is the interpolation operator from \(C(\Omega )\) to \(\mathbb {P}_N \otimes \mathbb {P}_N\).

Let \(\xi _{kj}(x,y)=\phi _k(x)\psi _j(y)\) and write

$$\begin{aligned} u_N = \sum _{k=0}^q \sum _{j=0}^q \hat{u}_{kj} \xi _{kj}(x,y). \end{aligned}$$
(2.13)

Plugging the above into (2.12) and taking \(w=\xi _{kj}\), we obtain the matrix system:

$$\begin{aligned} \alpha M \hat{U}M + ( \hat{U}M + M \hat{U})= \hat{F}, \end{aligned}$$
(2.14)

where

$$\begin{aligned} \hat{U}_{kj} = \hat{u}_{kj}, \quad M_{kj} = (\phi _j, \phi _k), \quad \hat{F}_{kj}=(\mathcal {I}_N f,\xi _{kj}). \end{aligned}$$
(2.15)

To this end, we introduce two crucial ingredients that will allow the spectral-Galerkin method to be efficiently parallelized on the GPU. First, the right hand side matrix, \(\hat{F}\), shall be formed directly from the Gauss quadrature, i.e.,

$$\begin{aligned} \hat{F}_{kj} = (\mathcal {I}_N f,\xi _{kj}) \approx \sum _{i=0}^N \sum _{l=0}^N f_{il}(x_i,y_l) \xi _{kj}(x_i,y_l) w_i w_l , \end{aligned}$$
(2.16)

where \(\{ x_i,y_l\}\) and \(\{w_i,w_l \}\) can be Gauss quadrature points and weights of any type.

Noted that the above approach is slightly different from the original spectral-Galerkin method (cf. [23]), in which the right hand side matrix and the unknown function evaluations are obtained through spectral-physical transforms. In this new algorithm, direct quadratures have the same computational complexity as the transforms in the original spectral-Galerkin method. The new approach is more storage consuming. But it is more suitable for the GPU computing, because every step is modularized as a full matrix operation. In Sect. 3, we will further discuss implementation details of the proposed spectral-Galerkin method on a GPU.

To continue, we still use Legendre-Gauss-Lobatto points and weights. The numerical solution can be evaluated by direct, point-wise evaluations:

$$\begin{aligned} u_N (x_i,y_l) = \sum _{k=0}^q \sum _{j=0}^q \hat{u}_{kj} \xi _{kj}(x_i,y_l). \end{aligned}$$
(2.17)

Therefore, the processes of forming \(\hat{F}\) and evaluating \(u_N(x_i,y_l)\) can be cast into forms of dense matrix-matrix multiplications:

$$\begin{aligned}&\hat{F}= Q \bar{F}Q^T, \end{aligned}$$
(2.18)
$$\begin{aligned}&\bar{U}= P \hat{U}P^T, \end{aligned}$$
(2.19)

where

$$\begin{aligned} P_{ik}&= \phi _k(x_i), \quad Q_{ki}=\phi _k(x_i) w_i;\nonumber \\ \bar{U}_{il}&= u_N(x_i,y_l),\quad \bar{F}_{il} = f(x_i,y_l). \end{aligned}$$
(2.20)

Note that in this simplified case we have the same \(M\), \(P\), and \(Q\) in both \(x\) and \(y\) dimensions.

The second ingredient of the present algorithm is the matrix diagonalization technique for solving (2.14). Consider the eigenvalue problem:

$$\begin{aligned} ME = E\Lambda , \end{aligned}$$
(2.21)

where \(E\) is orthonormal and \(\Lambda \) is diagonal (this decomposition is guaranteed to be well conditioned thanks to the nice structure of \(M\)). Then we plug into (2.14) the following substitution:

$$\begin{aligned} \hat{U}= E \hat{V}E^T. \end{aligned}$$
(2.22)

Multiplying the resultant equations by \(E^T\) on the left and \(E\) on the right, and using the fact that \(E^T E=I\), one ends up with

$$\begin{aligned} \alpha \Lambda \hat{V}\Lambda + (\Lambda \hat{V}+\hat{V}\Lambda ) = \hat{B}, \end{aligned}$$
(2.23)

where

$$\begin{aligned} \hat{B}= E^T \hat{F}E. \end{aligned}$$
(2.24)

Because \(\Lambda \) is diagonal, (2.23) can be decoupled completely.

The above algorithm involves mainly two kinds of computations: (1) dense matrix-matrix multiplications in (2.18), (2.19), (2.22), and (2.24); (2) point-wise divisions in solving for \(\hat{V}\) in (2.23). The eigenvalue decomposition in (2.21) can be precomputed and stored once for all. Therefore, the computational complexity is dominated by the matrix-matrix multiplications, which cost O(\(N^3\)) flops.

In the end of this section, we consider the 3-D case, i.e. the same equation in (2.10) with \(\Omega =(-1,1)^3\). Following similar ideas in the 2-D case, we have the following matrix system for the unknown variable:

$$\begin{aligned} {[}\alpha M \otimes M \otimes M + (M \otimes I \otimes I + I\otimes M \otimes I + M \otimes I \otimes I)] \hat{u}= \hat{f}, \end{aligned}$$
(2.25)

where \(\otimes \) denotes the Kronecker product between two matrices, \(\hat{u}\) and \(\hat{f}\) are vectors of dimension \((q+1)^3\), and they can be converted as follows

$$\begin{aligned} \hat{f}&=( Q \otimes Q \otimes Q) \bar{f}, \end{aligned}$$
(2.26)
$$\begin{aligned} \bar{u}&= (P \otimes P \otimes P) \hat{u}. \end{aligned}$$
(2.27)

Plugging in the substitution

$$\begin{aligned} \hat{u}= (E \otimes E \otimes E) \hat{v}, \end{aligned}$$
(2.28)

we end up with the following decoupled series of scalar equations:

$$\begin{aligned} {[}\alpha \lambda _i \lambda _j \lambda _k + (\lambda _i + \lambda _j + \lambda _k)] \hat{v}_{ijk} = \hat{b}_{ijk}, \quad 0 \leqslant i, j, k \leqslant q, \end{aligned}$$
(2.29)

where \(\hat{b}_{ijk}\) is the element of

$$\begin{aligned} \hat{b}= (E^T \otimes E^T \otimes E^T) \hat{f}. \end{aligned}$$
(2.30)

Like the 2-D case, the above algorithm also involves dense matrix-matrix multiplications in (2.26), (2.27), (2.28), and (2.30). Therefore, the computational complexity is O(\(N^4\)) flops.

2.1.3 Spectral-Galerkin Method for Two Coupled Equations

Our core algorithm is described in this subsection through a system of two coupled elliptic equations in a two-dimensional square domain. Let \((x,y) \in \Omega \triangleq (-1,1)^2\). Consider:

$$\begin{aligned} \left\{ \begin{aligned}&\alpha u + \beta v - \gamma \Delta u = f, \text { in } \Omega , \\&\tilde{\alpha }u + \tilde{\beta }v - \tilde{\gamma }\Delta v = g, \text { in } \Omega , \end{aligned}\right. \end{aligned}$$
(2.31)

subject to the homogeneous Dirichlet boundary conditions for \(u\) and \(v\). Applying the same Legendre-Galerkin method in Section 2.1.1 to (2.31), we obtain the matrix system:

$$\begin{aligned} \left\{ \begin{aligned}&\alpha M \hat{U}M + \beta M \hat{V}M + \gamma ( M\hat{U}+ \hat{U}M)= \hat{F}, \\&\tilde{\alpha }M \hat{U}M + \tilde{\beta }M \hat{V}M + \tilde{\gamma }( M\hat{V}+ \hat{V}M)= \hat{G}, \end{aligned} \right. \end{aligned}$$
(2.32)

where

$$\begin{aligned} \hat{F}= Q \bar{F}Q^T, \quad \hat{G}= Q \bar{G}Q^T. \end{aligned}$$
(2.33)

The way that we form \(\hat{F}\) and \(\hat{G}\) is different from that in the original spectral-Galerkin method for systems of coupled equations (cf. [4]).

Now consider the same eigenvalue decomposition in (2.21), and the substitutions:

$$\begin{aligned} \hat{U}= E \hat{W}E^T, \quad \hat{V}= E \hat{Z}E^T. \end{aligned}$$
(2.34)

Multiplying the resultant equations by \(E^{T}\) on the left and \(E\) on the right, one ends up with

$$\begin{aligned} \left\{ \begin{aligned}&\alpha \Lambda \hat{W}\Lambda + \beta \Lambda \hat{Z}\Lambda + \gamma (\Lambda \hat{W}+\hat{W}\Lambda ) = \hat{A},\\&\tilde{\alpha }\Lambda \hat{W}\Lambda + \tilde{\beta }\Lambda \hat{Z}\Lambda + \tilde{\gamma }(\Lambda \hat{Z}+\hat{Z}\Lambda ) = \hat{B}, \end{aligned} \right. \end{aligned}$$
(2.35)

where

$$\begin{aligned} \hat{A}= E^{T} \hat{F}E, \quad \hat{B}= E^{T} \hat{G}E. \end{aligned}$$
(2.36)

Because \(\Lambda \) is diagonal, (2.35) can be decoupled completely. More precisely, one only needs to solve a \(2\times 2\) linear system for each index \((i,j)\), as follows:

$$\begin{aligned} \left\{ \begin{aligned}&\lambda _i\lambda _j (\alpha w_{ij}+ \beta z_{ij})+ \gamma (\lambda _i w_{ij}+\lambda _j w_{ij}) = a_{ij} , \\&\lambda _i\lambda _j (\tilde{\alpha }w_{ij}+ \tilde{\beta }z_{ij})+ \tilde{\gamma }(\lambda _i z_{ij}+\lambda _j z_{ij}) = b_{ij} , \end{aligned} \right. \end{aligned}$$
(2.37)

where \(w_{ij},z_{ij},a_{ij},b_{ij}\) are entries of \(\hat{W},\hat{Z},\hat{A},\hat{B}\), and \(\lambda _i\)’s are diagonal entries of \(\Lambda \). Finally, we obtain the physical solution through

$$\begin{aligned} \bar{U}= P \hat{U}P^T,\quad \bar{V}= P \hat{V}P^T. \end{aligned}$$
(2.38)

We summarize below the algorithm for solving (2.31) on the CPU.

Algorithm 1

  1. 1.

    Precompute \(\{E,\Lambda ,P,Q\}\) in different dimensions;

  2. 2.

    Compute \(\{\hat{F},\hat{G}\}\) from (2.33);

  3. 3.

    Compute \(\{\hat{A},\hat{B}\}\) from (2.36);

  4. 4.

    Solve the decoupled system (2.37) to obtain \(\{\hat{W},\hat{Z}\}\);

  5. 5.

    Compute \(\{\hat{U},\hat{V}\}\) from (2.34);

  6. 6.

    Compute \(\{\bar{U},\bar{V}\}\) in (2.38).

2.1.4 Spectral-Galerkin Method for \(L\) Coupled Equations

It is easy to see that the spectral-Galerkin method can be extended to systems of more than two equations. Denote the number of equations in the system as \(L\). Consider the 2-D case:

$$\begin{aligned} \left\{ \begin{aligned}&A \varvec{u}- (B^x \varvec{u}_{xx} + B^y \varvec{u}_{yy}) = \varvec{f}, \quad \text {in } \Omega , \\&a_{\pm }^x \varvec{u}(\pm 1, y) + b_{\pm }^x \varvec{u}_x(\pm 1, y) = \varvec{0}, \\&a_{\pm }^y \varvec{u}(x,\pm 1) + b_{\pm }^y \varvec{u}_y(x,\pm 1) = \varvec{0}, \end{aligned} \right. \end{aligned}$$
(2.39)

where \(\varvec{u}\) is the unknown vector of length \(L\), \(\{ A,B^x ,B^y \}\) are given constant matrices of size \(L \times L\), \(\{a_{\pm }^x,b_{\pm }^x,a_{\pm }^y,b_{\pm }^y\}\) are given constants, and \( \varvec{f}\) is a given vector function of length \(L\).

Following the same procedure as the last subsection, we end up with an \(L \times L\) linear system for each \((i,j)\):

$$\begin{aligned} (\lambda ^x_i \lambda ^y_j A + \lambda ^x_i B^x + \lambda ^y_j B^y) \hat{v}_{ij} = \hat{g}_{ij}, \end{aligned}$$
(2.40)

where \(\hat{v}\) is the substituted unknown vector, and \(\hat{g}\) comes from the right hand side.

2.1.5 3-D Case

The algorithm can be further generalized to the three dimensional case. Consider the system in (2.31) now with \(\Omega =(-1,1)^3\), and assume that \(u\) and \(v\) satisfy the homogeneous Dirichlet boundary conditions. Similar to the two-dimensional case, we can write the Legendre-Galerkin approximation to the system (2.31) in 3-D as:

$$\begin{aligned} \left\{ \begin{aligned}&(M\otimes M\otimes M)(\alpha \hat{u}+ \beta \hat{v})+ \gamma (M\otimes I\otimes I + I\otimes M \otimes I+ I\otimes I\otimes M) \hat{u}= \hat{f}, \\&(M\otimes M\otimes M)(\tilde{\alpha }\hat{u}+ \tilde{\beta }\hat{v})+ \tilde{\gamma }(M\otimes I\otimes I + I\otimes M \otimes I+ I\otimes I\otimes M) \hat{v}= \hat{g}, \end{aligned} \right. \end{aligned}$$
(2.41)

where \(A_1\otimes A_2\) denotes the Kronecker product of matrices \(A_1\) and \(A_2\). As before, let \((E,\Lambda )\) be the eigen-pair of \(M\), i.e., \(ME=E\Lambda \). Applying the following substitutions:

$$\begin{aligned} \hat{u}= (E \otimes E \otimes E) \hat{w}, \quad \hat{v}= (E \otimes E \otimes E) \hat{z}, \end{aligned}$$
(2.42)

we find that (2.41) reduces to a sequence of \(2\times 2\) linear systems:

$$\begin{aligned} \left\{ \begin{aligned}&\lambda _i \lambda _j \lambda _k ( \alpha w_{ijk}+ \beta z_{ijk}) + \gamma (\lambda _i + \lambda _j +\lambda _k ) w_{ijk} = a_{ijk}, \\&\lambda _i \lambda _j \lambda _k ( \tilde{\alpha }w_{ijk} + \tilde{\beta }z_{ijk} ) + \tilde{\gamma }( \lambda _i+ \lambda _j + \lambda _k) z_{ijk} = b_{ijk}, \end{aligned} \right. \end{aligned}$$
(2.43)

where \(a_{ijk}\) and \(b_{ijk}\) are elements of vectors \( \big ( E^{T} \otimes E^{T} \otimes E^{T} \big ) \hat{f}\) and \( \big ( E^{T} \otimes E^{T} \otimes E^{T} \big ) \hat{g}\), respectively. In other words, we now solve a \(2\times 2\) system for each \((i,j,k)\), for \(0 \leqslant i,j,k \leqslant q\).

2.2 Spectral Collocation Method

In this subsection, we describe how to extend the framework to a GPU-suited spectral collocaton method for coupled equations. In the following, a variant of the spectral collocation method is introduced for the 1-D case and provides the building block for the tensor product formulation.

We first consider the following elliptic equation in 1-D:

$$\begin{aligned} \left\{ \begin{aligned}&\alpha u - u^\prime \prime = f, \quad \text {in } \Omega = (-1,1),\\&a_{\pm } u (\pm 1) + b_{\pm } u^\prime (\pm 1) = c_{\pm }, \end{aligned} \right. \end{aligned}$$
(2.44)

where \(\{\alpha , a_{\pm }, b_{\pm }, c_{\pm }\}\) are constants, and \(f\) is a right hand side function. Given a set of Gauss-Lobatto points, \(\{x_i\}_{i=0}^N\), we require the numerical approximation, \(u_N\), to satisfy (2.44) strongly at each \(x_i\). To be specific, the spectral collocation method consists of

$$\begin{aligned} \left\{ \begin{aligned}&\text {finding } u_N \in \mathbb {P}_N \text { such that } \\&\alpha u_N(x_i) - u''_N(x_i) = f(x_i), \quad 1 \leqslant i \leqslant N-1,\\&a_{\pm } u_N (\pm 1) + b_{\pm } u^\prime _N (\pm 1) = c_{\pm }. \end{aligned} \right. \end{aligned}$$
(2.45)

Let \(\{h_i\}_{i=0}^N\) be the Lagrange basis polynomials of degree \(N\) associated with \(\{x_i\}_{i=0}^N\). We introduce the first and second order differentiation matrices:

$$\begin{aligned} D^{(1)} = \big ( d^1_{kj} := h'_j(x_k) \big )_{0 \leqslant k,j \leqslant N}, \quad D^{(2)} = \big ( d^2_{kj} := h''_j(x_k) \big )_{0 \leqslant k,j \leqslant N}. \end{aligned}$$
(2.46)

Then, \(u_N\) can be written in terms of a linear combination of \(\{h_i\}_{i=0}^N\), as follows,

$$\begin{aligned} u_N(x) = \sum _{j=0}^N w_j h_j(x), \quad w_j := u_N(x_j). \end{aligned}$$
(2.47)

Substituting the above into (2.45) leads to

$$\begin{aligned} \left\{ \begin{aligned}&\sum _{j=0}^N (\alpha \delta _{ij} - d^2_{ij})w_j = f(x_i), \quad 1 \leqslant i \leqslant N-1,\\&a_{-} w_N + b_{-} \sum _{j=0}^N d^1_{Nj} w_j = c_{-}, \quad a_{+} w_0 + b_{+} \sum _{j=0}^N d^1_{0j} w_j = c_{+}, \end{aligned} \right. \end{aligned}$$
(2.48)

where \(\delta _{ij}\) is the Kronecker delta.

To solve (2.48), a naive approach is to form a matrix system of size \((N+1) \times (N+1)\), and then solve for interior values and boundary values simultaneously by using a direct matrix solver. In other words, one needs to inverse a full matrix. It appears to be concise and effective for the 1-D problem. However, for an \(n\)-D problem, a big matrix of size \((N+1)^n \times (N+1)^n\) needs to be inversed. Therefore, the computational cost is enormous and the algorithm is not optimal.

In the following, we introduce another approach, which dramatically reduces the arithmetic complexity and is later proven in our paper suitable for the GPU computing. It is important to incorporate those boundary conditions into the collocation matrix \(S\), in order to write the system into the tensor-product form.

First, we rewrite equations involving boundary values as a \(2\times 2\) system in \(w_0\) and \(w_N\):

$$\begin{aligned} \left\{ \begin{aligned}&b_{-} d^1_{N0} w_0 + (a_{-} + b_{-} d^1_{NN}) w_N = c_{-} - b_{-} \sum _{j=1}^{N-1} d^1_{Nj} w_j, \\&(a_{+} + b_{+} d^1_{00}) w_0+b_{+} d^1_{0N} w_N = c_{+} - b_{+} \sum _{j=1}^{N-1} d^1_{0j} w_j. \end{aligned} \right. \end{aligned}$$
(2.49)

Solving for \(w_0\) and \(w_N\) symbolically, we have

$$\begin{aligned} w_0 = \tilde{c}_+ - \sum _{j=1}^{N-1} \tilde{\alpha }_{0j} w_j, \quad w_N = \tilde{c}_- - \sum _{j=1}^{N-1} \tilde{\alpha }_{Nj} w_j, \end{aligned}$$
(2.50)

where

$$\begin{aligned} \begin{aligned}&\tilde{a}= b_- d^1_{Nj}, \quad \tilde{b}= a_- + b_- d^1_{NN}, \quad \tilde{c}= a_+ + b_+ d^1_{00},\quad \tilde{d}= b_+ d^1_{0N}, \\&\tilde{\alpha }_{0j}=(\tilde{d}b_- d^1_{Nj} - \tilde{b}b_+ d^1_{0j})/\xi , \quad \tilde{\alpha }_{Nj}=(\tilde{a}b_+ d^1_{0j} - \tilde{c}b_- d^1_{Nj})/\xi , \\&\tilde{c}_+ = (\tilde{d}c_- - \tilde{b}c_+) /\xi , \quad \tilde{c}_- = (\tilde{a}c_+ - \tilde{c}c_-) /\xi ,\quad \xi = \tilde{a}\tilde{d}- \tilde{c}\tilde{d}. \\ \end{aligned} \end{aligned}$$
(2.51)

Plug (2.50) into (2.48) to obtain the following linear system:

$$\begin{aligned} (\alpha I- S) \bar{w} = \bar{b}, \end{aligned}$$
(2.52)

where \(\bar{w}_i = w_i\) and \(S\) is an \((N-1)\times (N-1)\) matrix with entries:

$$\begin{aligned} s_{ij} = d^2_{ij} - d^2_{i0} \tilde{\alpha }_{0j}. \end{aligned}$$
(2.53)

In addition, we need to define a modified right hand side as follows:

$$\begin{aligned} \bar{b}_i = f(x_i) +\tilde{c}_+ d^1_{i0} + \tilde{c}_- d^1_{iN} . \end{aligned}$$
(2.54)

After (2.52) is solved, the boundary values can be calculated according to (2.50).

Remark 2.1

In this new approach, the discrete boundary conditions were absorbed into the linear system. One does not have to modify the linear system in order to satisfy the boundary conditions. Therefore, it allows one to form the linear system in high dimensional cases with the tensor product formulation.

Remark 2.2

We notice that \(S\) depends only on \(a_\pm \) and \(b_\pm \), while \(\tilde{c}_\pm \) also depend on \(c_\pm \). It means that the linear system remains the same even if \(c_{\pm }\) change in time for evolutionary systems, because \(\{a_{\pm }, b_{\pm }\}\) do not change.

Remark 2.3

The second order derivatives of \(\bar{w}\) can be evaluated easily with the following operation:

$$\begin{aligned} \bar{w}' = S \bar{w} +\tilde{c}_+ d^2_{i0} +\tilde{c}_- d^2_{iN}. \end{aligned}$$
(2.55)

The same technique works for other orders of derivatives.

It remains to solve the linear system (2.52). First, we consider the following eigenvalue decomposition:

$$\begin{aligned} SE = E \Lambda , \end{aligned}$$
(2.56)

where \(E\) is non-singular and \(\Lambda \) is diagonal. It is well-known that \(\{\lambda _j\}\) are all real and positive (cf. [11, 24]). This eigenvalue decomposition, which is done through the Lapack subroutine dgeev, contributes the majority of the initialization step. Next, we consider the substitution:

$$\begin{aligned} \bar{w} = E \bar{v}. \end{aligned}$$
(2.57)

Plug (2.57) into (2.52), and multiply the result by \(E^{-1}\) from the left, to have

$$\begin{aligned} (\alpha I - \Lambda ) \bar{v} = E^{-1} \bar{b}, \end{aligned}$$
(2.58)

which is a diagonal system. Therefore, the system is decoupled for each component of \(\bar{v}\). In rest of the paper, we focus on the above diagonalization technique.

Based on the above discussion of the 1-D case and the discussion in Sect. 2.1, we can extend the spectral collocation method to a general elliptic system on a separable domain:

$$\begin{aligned} A_1(x)A_2(y) A_3(z)\varvec{u}\!+\!(B_1 (x) \varvec{u}_{x} \!+\! B_2 (y) \varvec{u}_{y}\!+\! B_3 (z) \varvec{u}_{z})\!-\! (C_1 (x) \varvec{u}_{xx} \!+ \!C_2 (y) \varvec{u}_{yy}\!+\! C_3 (z) \varvec{u}_{zz}) \!=\! \varvec{f}, \end{aligned}$$
(2.59)

where \(\varvec{u}\) is the unknown vector of length \(L\) and all coefficients are variable matrices of size \(L \times L\). The boundary condition can be as general as

$$\begin{aligned} a \varvec{u}+ b \frac{\partial \varvec{u}}{\partial \varvec{n}}= \varvec{c}(x,y,z), \end{aligned}$$
(2.60)

where \(\{a,b\}\) are given constants and \(\varvec{c}\) is a given vector function of length \(L\).

3 Accelerating Spectral Methods on a CUDA GPU

We discuss the design and the implementation of GPU-accelerated spectral methods for systems of coupled equations. In Sect. 3.1, we provide observations on the initialization and the memory throughput. Then, core calculations of the GPU acceleration – the CUBLAS library and user lever kernel functions – shall be introduced and incorporated into spectral methods in Sects. 3.2 and 3.3. In Sect. 3.4, we address the issue of the data structure in the 3-D case and optimize the matrix transposes. In many places, we use the spectral-Galerkin method for the illustration.

All our experiments were done in double precision on an Intel Xeon-E5540 CPU (eight-core threaded with the MKL BLAS library) and on an Nvidia Tesla K20 GPU, unless specified otherwise. Computing times were given in seconds throughout the paper.

3.1 Comparison of Two Methods on the Initialization and Memory Throughput

In order to minimize the memory transfer between the host (CPU) and the device (GPU), data that do not change in time, which we call static data, should be precomputed on the host, shipped from the host to the device, and stored on the device once and for all. This procedure is critical to an excellent performance of solving time-dependent PDEs on the GPU when a spectral method is used.

First of all, these static data need to be precomputed and stored on the CPU, for example, through step (1) in Algorithm 1. In Table 1, we list static data/arrays of significant sizes in the spectral-Galerkin method. For simplicity, only static data in the \(x\) dimension are listed. For 2-D or 3-D cases, one multiplies these cost by two or three. To get a sense of the total memory usage, we need to include other static data, such as transformation matrices and differentiation matrices that are used in typical applications, as well as dynamic data related to the solution matrix. It is also preferred to allocate one working matrix to avoid in-place function calls in various places of the implementation. Current GPUs are still limited by storage. For example, an Nvidia Tesla K20 GPU has the storage of 6 Gigabytes. If we take \(q=2^K\), a simple calculation reveals that we are limited with \(K \leqslant 12\) in 2-D and \(K \leqslant 9\) in 3-D.

Table 1 Static data in the spectral-Galerkin method (\(x\)-dimension only)

Data in Table 1 will be initialized on the host. The wall clock time of this procedure is denoted as Init (meaning initialization). Figure 2(a) shows that the initializing cost of the spectral-Galerkin method is much less than that of the spectral collocation method. It is because that the eigenvalue decomposition of (2.21), contributing the majority part of Init for the spectral-Galerkin method, is done for a banded symmetric positive definite matrix.

Then, we use the CUDA runtime API, cudaMemcpy, to ship the static data from the host to the device. The wall clock time of this step is denoted as HtD (meaning the host to the device). From Fig. 2(b), it is clear that the spectral-Galerkin method costs less storage than the spectral collocation method does.

Fig. 2
figure 2

Wall clock times in seconds of the initialization on the host (Init, Step 1 in Algorithm 1) and the memory copy from the host to the device (HtD) in the 2-D case. The above two operations together with DtH (device to host) are performed only once in the whole process of solving a time-dependent PDE. In each dimension, \(N=2^K\)

In order to complete the algorithm, only the solution matrix needs to be shipped back to the host after it is solved on the device. Hence DtH is much smaller than HtD. Two methods are the nearly same in this step.

Therefore, in terms of the GPU acceleration, the focus of our method is in parallelizing step (2)–(6) of Algorithm 1. Each of them involves updating dynamic data related to the unknown matrices \(\bar{U}\) and \(\bar{V}\). Nevertheless, shipping static data to the device is crucial in our algorithm. It is because that minimizing the memory transfer between host and device is the high priority for optimizing a CUDA application. This point has been laid out at the beginning of this subsection.

3.2 CUBLAS: An Efficient and Convenient Strategy for Matrix Multiplications

The step (2)–(6) in Algorithm 1, especially (2), (3), (5), and (6), dominate the computational complexity and contribute majority of computing times. Thus, it is important to implement them with an efficient parallelizing strategy. The first observation is that they are essentially matrix-multiplications thus can be performed with a BLAS operation on the CPU [6, 16]. In Table 2, we list the five core operations. According (2.33), n2b denotes the calculation of the inner-products on the right hand side from the nodal representation to the basis representation. b2s means the transformation from the coupled basis representation to the decoupled spectral space through (2.36). Vise versa, s2b and b2n represent (2.34) and (2.38), respectively. The step (4), kernel, is the user defined kernel function in CUDA for solving the decoupled system in (2.37). It is not a linear operation, and we will deal with it separately in next subsection.

Table 2 Core operations in the spectral-Galerkin method, their meanings, and corresponding BLAS operations

We choose the CUBLAS library to perform those numerical linear algebra tasks on the GPU. The library is easy to use, optimized and heavily tuned for Nvidia GPUs. In this sense, a good performance of our algorithm is expected and it is very implementation friendly. Notice that b2s and s2b in Algorithm 1 are multiplications of square matrices of size \((q+1)\times (q+1)\), while n2b and b2n are those of non-square matrices of size \((q+1)\times (N+1)\) and \((N+1)\times (q+1)\). In the implementation, users of CUBLAS need to create one or several handles for operations in Table 2. That allows CUDA to collect all kinds of CUBLAS operations involved in the program, and the optimal grids will be arranged automatically by the library infrastructure.

The spectral methods use less number of degrees of freedom and hence fewer memory accesses. But all the matrix operations in our algorithm are dense, leading to higher arithmetic intensities. The above features make the spectral methods especially benefited on a GPU with high core density and massively thread-parallelizations. It is the key to a remarkable efficiency as shown in Sect. 4.

3.3 Establishing the Kernel

We first look at the parallelizing strategy for step (4) in Algorithm 1. While the computational complexity (not necessarily computing times) of step (4) is negligible compared to other steps, it still needs to be parallelized once the dynamic arrays (\(\bar{U}\) and \(\bar{V}\)) are on the device. Otherwise, the overall speedup can deteriorate due to intermediate serial operations or possible memory transfers. In this subsection, we focus on the parallelizing strategy for step (4), which is one of the kernel functions that needs to be implemented by users.

We can immediately see that solving (2.40) is an excellent scenario for the GPU acceleration, because the \(L\times L\) system of each index \((i,j)\) can be mapped to an independent thread of the device. In other words, each thread invokes a device function in our implementation. When \(L\) is as small as 2 or 3, it is possible to achieve better performance with analytic solution formula of the linear system. For larger and general \(L\) values, we use an LU decomposition with the row-major pivoting on each thread.

The question remains how to design optimal grid and block sizes for this particular kernel function. First, we determine the block size. Then the grid size is determined as the quotient of the matrix size and the block size. To this end, we carried out an experiment of the CUDA kernel operation using random matrices of size \(2^K \times 2^K\) with \(L=5\). The result is listed in Fig. 3. Figure 3 is consistent with the wide known fact that the optimal grid size should be a multiple of the warp size or half-warp size of the device. For many GPUs, the warp size would be either 16 or 32. Though Grid A (see Figure 3) leads to the largest grid size, it does not achieve the highest efficiency since the latency between different threads becomes significant. In our experiments we found also that the optimal grids for different \(L\)’s are qualitatively the same. Sample CUDA codes in the 2-D case are provided in Appendix (Util_LU_SG2D).

Table 3 Run times in seconds for different types of transposes. transpose-i is for a matrix of size \(2^{11} \times (3\times 2^{11})\) and the tile size is \(16 \times 16\); transpose-ii is for a matrix of size \(3 \times (2^{11} \times 2^{11})\) and the tile size is \(3 \times 128\). The experiments were done on a Nvidia Tesla 2050M GPU
Fig. 3
figure 3

Computing times in seconds for different grid configurations in kernel with \(L=5\). The block size (number of threads per block) in each of the above three grids is: \(1\times 1\), \(16\times 16\), \(32\times 32\). The grid size is then determined as the quotient of the matrix size and the block size

Essential ideas remain the same for the 3-D case. Auxiliary data are initialized on the host and shipped to the device once for all. In terms of the grid arrangement, the optimal grid size still matches the warp size, though a 3-D grid is launched.

3.4 Data Structure and Optimizing the Peformance of Matrix Transposes

As described in Sect. 2.1.4, the algorithm is designed and implemented for general values of \(L\). It leads us to use a one dimensional array in the implementation to store all \(L\) components of the unknown solution. In other words, the unknown matrix in the program is re-aligned as a single long vector by going through \(x\) dimension first, \(y\) second, \(z\) third, and \(L\) different components lastly (we call the fourth dimension the \(L\) dimension). In 2-D, the size of the array is \(q_x \times q_y \times L\), which leads to a 3-D data structure. In 3-D, the array is actually 4-D.

We first discuss the 2-D case, in which the unknown matrix has three dimensions. The issue here is that the memory access is not coalesced in \(y\) and \(L\) dimensions. First, there is a stride of length \(q_x\) in the \(y\) dimension. Hence, before the matrix-matrix multiplication along the \(y\) dimension, the array needs to be re-aligned in order to be accessed continuously. Taking a specific example, we perform a matrix transpose before the right multiplications of \(E^T\) and \(E\) in (2.34) and (2.36), to ensure the \(y\)-index goes first, and perform another transpose after to reset the order. We denote these operations as transpose-i. Essentially, we are transposing matrices of size \(q_x \times q_y L\). When \(q_x\) and \(q_y\) are large, we use the technique of the memory coalesce to accelerate transpose-i. By dividing the matrix into tiles of size \(2^4\times 2^4\), a remarkable performance can be achieved with the GPU (see Table 3). The essential idea of memory coalesce is utilizing the on-cache feature of the shared memory on each block, as the short length stride would be not a problem on fast accessing caches. Readers may refer [21] for more details. Sample CUDA codes for transpose-i are provided in the appendix (Util_transpose_coalesce).

Fig. 4
figure 4

An illustration of two types of matrix transposes with \(q_x=q_y=4\) and \(L=2\)

Secondly, in order to solve those \(L \times L\) systems in (2.40) contiguously, we need to transpose the matrix along the \(L\) dimension as well. Otherwise, there would be a stride of length \(q_x q_y\) in the \(L\) dimension. This leads us to transpose a matrix of size \(q_x q_y \times L\) or \(L \times q_x q_y\). We denote these transposes as transpose-ii. Two types of transposes are visualized in Fig. 4. In order to achieve the best performance, the tile size needs to be adjusted according to \(q_x\), \(q_y\), and \(L\). For example, when \(q_x=q_y=2^{11}\) and \(L=3\), it is natural to choose the tile size to be \(3 \times 2^7\) and the grid size to be \(1 \times 2^{15}\), by noting that the maximum number of blocks in any dimension in a grid should not exceed \(2^{16}-1\).

Notice that all the matrices are allocated dynamically in the parameter \(L\). For transpose-ii with memory coalesce, one needs to go inside the program and modify the parameter for the kernel, because the the shared memory is not dynamic. Hence, in terms of implementation, the naive transpose is better than the memory coalesce, especially for transpose-ii.

In Table 3, run times for two type of transposes with three different implementations are provided. For transpose-i, both two GPU implementations are better than the CPU performance. And the memory coalesce technique can improve the performance significantly. For transpose-ii, the tile size (\(3 \times 2^7\)) does not match the warp size, which is the reason for the worse performance. In many cases, the naive copy is slower than the memory coalesce technique due to the stride in the memory access to the transpose. However, we are dealing with long tiles for transpose-ii, in which we observe that two methods do not have significant differences. Therefore, for the ease of implementation, we use in this paper the transpose with naive copy for transpose-ii.

With all above discussions, we are now able to summarize below the spectral-Galerkin method for a system of coupled equations on the GPU.

Algorithm 2

  1. 1.

    Precomputation on the host:

    1. (a)

      Precompute and store \(\{ E, \Lambda , P, Q\}\) in different dimensions;

  2. 2.

    Ship the above static variables from the host to the device;

  3. 3.

    Computation on the device:

    1. (a)

      n2b: Compute \(\{\hat{F},\hat{G}\}\) in (2.33);

      1. i.

        matrix multiplication in \(x\);

      2. ii.

        transpose-i on a matrix of size \((q_x+1)\times (n_y+1)L\);

      3. iii.

        matrix multiplication in \(y\);

      4. iv.

        transpose back;

    2. (b)

      b2s: Compute \(\{\hat{A},\hat{B}\}\) in (2.36);

      1. i.

        matrix multiplication in \(x\);

      2. ii.

        transpose-i on a matrix of size \((q_x+1)\times (q_y+1)L\);

      3. iii.

        matrix multiplication in \(y\);

      4. iv.

        transpose back;

    3. (c)

      kernel: Solve the decoupled linear system in \(\{\hat{W},\hat{Z}\}\) from (2.37);

      1. i.

        transpose-ii on a matrix of size \((q_x+1) (q_y+1)\times L\);

      2. ii.

        call kernel to invoke a matrix solver on each thread of the grid;

      3. iii.

        transpose back;

    4. (d)

      s2b: Compute \(\{\hat{U},\hat{V}\}\) in (2.34), in a similar way as b2s;

    5. (e)

      b2n: Calculate \(\{\bar{U},\bar{V}\}\) according to (2.38), in a similar way as n2b;

  4. 4.

    Ship the result \(\{\bar{U},\bar{V}\}\) from the device to the host.

Sample CUDA codes for the essential computations (step 3) are provided in the appendix (SG2D_D_Solve).

In the 3-D case, one needs to perform transpose-i for both \(y\) and \(z\) dimensions, and perform transpose-ii for the \(L\) dimension. The algorithm can be designed in a similar fashion.

4 Numerical Results of Two Methods

We first demonstrate the superior accuracy obtained by the spectral-Galerkin method comparing with the spectral collocation method. Then, we report the speedup for model systems in 2-D and 3-D. In the third subsection, we apply the spectral collocation method to solve the 2-D FitzHugh-Nagumo equation. In the fourth subsection, we apply the spectral-Galerkin method to the 3-D Cahn-Hilliard equation.

The detailed information of the CPU and the GPU and the compilers used in the algorithm were described in the beginning of Sect. 3.

4.1 A Comparison of Numerical Accuracies

We consider the 2-D system (2.39) with homogeneous Neumann boundary conditions and \(L\) is an integer. The exact solution is chosen to be

$$\begin{aligned} u_m = \cos (m \pi x)\cos ((m+1) \pi y), \quad 1 \leqslant m \leqslant L. \end{aligned}$$
(4.1)

The coefficient matrices are chosen as

$$\begin{aligned} A_{ij}\!=\!\left\{ \begin{array}{ll} 2, &{}\quad \text {if } i\!=\!j \\ (i-1)(j-1)+1, &{}\quad \text {if } i \ne j \end{array} \right. \quad B^x_{ij}\!=\!\left\{ \begin{array}{ll} 3, &{}\quad \text {if } i\!=\!j \\ i\!+\!j\!-\!1, &{}\quad \text {if } i \!\ne \! j \end{array} \right. \quad B^y_{ij}\!=\!\left\{ \begin{array}{ll} 4, &{}\quad \text {if } i\!=\!j \\ i\!+\!j, &{}\quad \text {if } i \!\ne \! j \end{array} \right. \end{aligned}$$
(4.2)

and

$$\begin{aligned} a^x_\pm = 0, \quad b^x_\pm = \pm 1, \quad a^y_\pm = 0, \quad b^y_\pm = \pm 1. \end{aligned}$$
(4.3)

Then, \( \varvec{f}\) is calculated analytically according to the above configuration.

Fig. 5
figure 5

The Comparison of numerical accuracies of the spectral collocation method and the spectral-Galerkin method for (4.2) and (4.1) with \(L=2\)

In Fig. 5, we compare the \(l^\infty \)-errors of the spectral collocation method and the spectral-Galerkin method for the same model system above. Numerical results are obtained on the GPU (those on the CPU are identical up to round-off errors). In general, the numerical error of the collocation method is polluted by the large conditioning number of second order differentiation matrices, while the Galerkin method maintains almost the same level of accuracy even as \(K\) increases (cf. [20, 26]).

4.2 Costs and Speedups

After confirming that our algorithm does lead to spectral accuracy as expected, we list computing times on CPU and GPU respectively in Table 4. While the spectral-Galerkin method is more accurate, it is almost twice expensive as the spectral collocation method. This is consistent with our algorithm design in Sect. 2.

Table 4 Computing time in seconds and speedups of two methods for (4.2)–(4.1) with \(L=1\)

On the one hand, the theoretical peak performance (double precision) of the K20 GPU is 1.17 teraFLOPS while the clock speed of the CPU is 2.53 GHz. We assume that each CPU core runs 4 threads and each thread does 4 double precision operations with vectorization. Then, the peak performance of the 8-core CPU is about 134.4 gigaFLOPs. Therefore, in terms of the peak performance, the theoretical speed up is about 9. On the other hand, the K20 GPU has 2,496 cores while the Intel Xeon E5540 CPU we are using has 8 cores. Therefore, the theoretical speedup in terms of the number of cores may be argued as 2,496/8 = 312. However, this kind of speedup can only be achieved in ideal situations, for example, pointwise multiplication or addition of two very large vectors. Our spectral methods involve matrix multiplications and matrix transposes, which involve not only the pointwise operations. For example, each matrix multiplication of two \(n\times n\) matrices require \(n^2\) vector inner products, each of them is a summation that cannot be ‘pointwisely parallelized’. In our experiments, the CUDA threads achieve higher parallel efficiency and speedups as \(K\) increases. In general, the speedup is more than 20 when \(K\geqslant 8\). The global memory on the GPU is out when \(K>12\). In our framework, taking the GPU spectral-Galerkin method for example, the peak performance is about 0.27 teraFLOPS (\(4q^3 + q^2\) with \(q=2^{12}\) double precision operations per second). In other words, about %23 of the peak performance is achieved.

In Table 4, we also provided computing times and speedups of the collocation method on an Nvidia Tesla M2050 GPU, of which the peak performance (double precision) is 515 gigaFLOPS. It is reasonable to observe that the speedup of our method has reduced by a factor of two or three.

In Table 5, we report the speedups of the spectral-Galerkin method for \(L=6\). In Table 6, speedups of the 3-D case are listed. We choose the type of exact solution similar to (4.1). In all these cases, one order-of-magnitude speedups are achieved.

Table 5 Computing times and speedups for the spectral-Galerkin method in 2-D for (2.39) with \(L=6\)
Table 6 Computing time in seconds and speedups for the spectral-Galerkin method in 3-D with \(L=2\)

In Tables 4 and 5, the speedup begins to slightly decrease in several cases as \(K\) goes up. The primary reason is that kernel functions do not achieve the optimal configuration due to the size of operated matrices. For example, when \(L\) is a small prime integer, a transpose-ii has to be performed. In this case, it is impossible to match perfectly warped size (16) for the transpose tile. In addition, the spectral-Galerkin method involves operations on matrix of size \((q+1) \times (N+1)\) in the 2-D case and \((q+1) \times (N+1) \times (N+1)\) or similar in the 3-D case. Notice that \(N=q+2\). Hence, it is impossible for both \(q\) and \(N\) to match the warp size. Though these limitations have slightly affected the speedup, the GPU spectral methods are still much faster than the CPU versions and maintain the one-order-of-magnitude of speedup.

4.3 FitzHugh-Nagumo Equation with the Spectral Collocation Method

The FitzHugh-Nagumo equation can be used to simulate an excitable system, e.g., a neuron [9, 10, 18]. Typically, the computational cost of simulating a neuron network increases in the number of neurons in the network. Therefore, it is important to reduce the computational cost of simulating each single neuron. In this subsection, we use the GPU-accelerated spectral collocation method to reduce the cost of solving the FitzHugh-Nagumo equation.

Consider the FitzHugh-Nagumo Equation on \(\Omega \subset \mathbb {R}^2\), as follows:

$$\begin{aligned} \left\{ \begin{aligned}&u_t = D_u \Delta u + \frac{1}{\delta } h(u,v),\\&v_t = D_v \Delta v+ g(u,v), \\&\frac{\partial u}{\partial \varvec{n}}\Big |_{\partial \Omega } =\frac{\partial v}{\partial \varvec{n}}\Big |_{\partial \Omega } = 0, \end{aligned} \right. \end{aligned}$$
(4.4)

where \(D_u\) and \(D_v\) are diffusion coefficients for \(u\) and \(v\), respectively, and \(\delta \) is a given parameter. It is a reaction-diffusion system of two coupled nonlinear parabolic equations. We consider the classic cubic FHN local dynamics (cf. [13]) as follows,

$$\begin{aligned} \begin{aligned}&h(u,v) = Au(1-u)(u-a) -v,\\&g(u,v) = u-dv, \end{aligned} \end{aligned}$$
(4.5)

where \(A\), \(a\), and \(d\) are dimensionless parameters. Other nonlinear forms of \(h\) and \(g\) (cf. [1]) can be considered as well.

To illustrate how the GPU spectral method can be applied to time-dependent problems, we use a first-order semi-implicit scheme to discretize (4.4)–(4.5) in time. The methodology certainly applies to higher-oder temporal discretization. Suppose that \(\{u^{n},v^n\}\) are solved at the previous time step, we obtain \(\{u^{n+1},v^{n+1}\}\) from the following system:

$$\begin{aligned} \left\{ \begin{aligned}&\frac{u^{n+1}-u^n}{\delta t} = D_u \Delta u^{n+1}+ \frac{1}{\delta } h(u^{n},v^{n+1})\\&\frac{v^{n+1}-v^n}{\delta t} = D_v \Delta v^{n+1}+ g(u^{n+1},v^{n+1}), \\&\frac{\partial u^{n+1}}{\partial \varvec{n}}\Big |_{\partial \Omega } =\frac{\partial v^{n+1}}{\partial \varvec{n}}\Big |_{\partial \Omega } = 0, \end{aligned} \right. \end{aligned}$$
(4.6)

in which all the linear terms are treated implicitly and the nonlinear term explicitly.

It is seen from (4.6) that one needs to solve at each time step a system of two coupled linear elliptic equations. We use the GPU-accelerated Chebyshev collocation method to solve the coupled system.

Let \(\Omega =(-L_x, L_x) \times (-L_y, L_y)\). We set \(L_x=L_y=20\) and use the following initial condition (cf. [19, 28]):

$$\begin{aligned} u(x,y,0) = \left\{ \begin{aligned}&0,&\quad \{x<0\}\cup \{y>5\} \\&u_0,&\quad \text {otherwise}, \end{aligned} \right. \end{aligned}$$
(4.7)

and

$$\begin{aligned} v(x,y,0) = \left\{ \begin{aligned}&0.15,&\quad \{x<1\}\cap \{y<-10\} \\&0,&\quad \text {otherwise}, \end{aligned} \right. \end{aligned}$$
(4.8)

where \(c_1=5\), \(c_2=1\), and

$$\begin{aligned} u_0(x,y) = (1+e^{4|x|-c_1})^{-2}-(1+e^{4|x|-c_2})^{-2}. \end{aligned}$$
(4.9)

Other parameters are chosen as

$$\begin{aligned} D_u=1, \quad D_v = 0, \quad a=0.1, \quad d=0.5, \quad \delta =0.001, \quad A=1. \end{aligned}$$
(4.10)

Denote \(T_i\), \(T_c\), and \(T_g\) to be the initialization time, the CPU run time, and the GPU run time. Then the speedup for a general time-dependent problem is

$$\begin{aligned} \text {speedup} = \frac{T_i + m T_c}{T_i + m T_g}. \end{aligned}$$
(4.11)

For problems that require long time stepping, i.e., for a big \(m\), we have

$$\begin{aligned} \text {speedup} \approx \frac{T_c}{T_g}. \end{aligned}$$
(4.12)

Figure 6 shows an evolution of (4.4) from \(t=0\) to \(t=50\). In this example, \(T_c \approx \) 1 h and \(T_g \approx \) 4 min. In other words, a remarkable speedup of nearly 15 is achieved by using the GPU-accelerated Chebyshev collocation method.

Fig. 6
figure 6

Numerical simulation of spiral waves in the FitzHugh-Nagumo Equation with parameters in (4.7)–(4.10). The number of Chebyshev points in each dimension is \(N=256\), i.e., the equation is discretized on a \(256 \times 256\) Guassian grid. The time step size is \(\delta t = 10^{-4}\), which implies a long time simulation of \(5\times 10^{5}\) time steps. The computing takes about 1 h on the CPU but only 4 min on the GPU

4.4 Cahn-Hilliard Equation with the Spectral-Galerkin Method

In this subsection we consider the Cahn-Hilliard equation (cf. [3]). Let \(\Omega \) be the unit cube, i.e., \(\Omega =(-1,1)^3\). Denote \(\Gamma \) as the boundary of \(\Omega \) and \(\varvec{n}\) as the outward normal to \(\Gamma \). The Cahn-Hilliard equation is a nonlinear time-dependent system of two coupled equations, as follows:

$$\begin{aligned} \left\{ \begin{aligned}&\phi _t = M \Delta \mu ,&\text {in } \Omega , \\&\mu = \frac{f(\phi )}{\epsilon ^2}-\Delta \phi ,&\text {in } \Omega , \\&\frac{\partial \phi }{\partial \varvec{n}}=\frac{\partial \mu }{\partial \varvec{n}}=0,&\text {on } \Gamma , \end{aligned} \right. \end{aligned}$$
(4.13)

where \(\phi (\varvec{x},t)\) is called the phase variable. The nonlinear function \(f(\phi )\) is defined as:

$$\begin{aligned} f(\phi ) = F^\prime (\phi ), \quad F(\phi ) = (1-\phi )^2, \end{aligned}$$
(4.14)

In (4.13), \(\epsilon \) represents the inter-facial width between different materials, and \(M\) is a constant related to the mobility. The homogeneous Neumann (no-flux) boundary conditions are used.

Fig. 7
figure 7

The time evolution of 3-D Cahn-Hilliard equation in (4.13) through the scheme in (4.15) and the GPU-accelerated spectral-Galerkin method. The parameters are chosen as \(\epsilon = 0.02, M=0.001, \delta t = 0.001, K=7\). The computing takes about half day on the CPU but only 12 minutes on the GPU

We approximate (4.13) in time using a first-order energy stabilized scheme (cf. [25]). Assuming that \(\{\phi ^n,\mu ^n\}\) are known, we look for \(\{\mu ^{n+1},\phi ^{n+1}\}\) in the following space-continuous system:

$$\begin{aligned} \left\{ \begin{aligned}&\frac{ \phi ^{n+1}-\phi ^n}{\delta t} = M \Delta \mu ^{n+1},&\text {in } \Omega , \\&\mu ^{n+1} = \frac{s(\phi ^{n+1}-\phi ^n)}{\epsilon ^2}+\frac{f(\phi ^n)}{\epsilon ^2}-\Delta \phi ^{n+1},&\text {in } \Omega , \\&\frac{\partial \phi ^{n+1}}{\partial \varvec{n}}=\frac{\partial \mu ^{n+1}}{\partial \varvec{n}}=0,&\text {on } \Gamma , \end{aligned} \right. \end{aligned}$$
(4.15)

where \(s(\phi ^{n+1}-\phi ^n)/\epsilon ^2\) is a stabilization term ensuring the decreasing discrete energy. Hence, at each time step, we need to solve a 3-D system of two coupled elliptic equations in \(\{\mu ^{n+1},\phi ^{n+1}\}\). The above schemes can be easily extended to the second-order case (cf. [25]).

In Fig. 7, we show an example with the following initial condition (kissing balls):

$$\begin{aligned} \phi (x,y,0) = \tanh \frac{\sqrt{(x-\frac{1}{3})^2+y^2+z^2}-\frac{1}{3}}{\epsilon } +\tanh \frac{\sqrt{(x+\frac{1}{3})^2+y^2+z^2}-\frac{1}{3}}{\epsilon } - 1, \end{aligned}$$
(4.16)

which takes 20000 steps to reach the equilibrium. In this example, the run times are

$$\begin{aligned} T_i = 0.0403398, \quad T_g = 650.494, \quad T_c = 42032.6 \quad \Rightarrow \text {speedup} \approx 65, \end{aligned}$$
(4.17)

which is consistent with Table 6. In other words, the computing times are nearly twelve hours on the CPU but less than eleven minutes on the GPU.

5 Conclusion

In this paper, it is for the first time that a practical, unified framework for GPU-accelerated spectral methods for systems of coupled elliptic equations in both 2-D and 3-D domains has been provided. Two of the most popular variants of spectral methods, the spectral collocation method and the spectral-Galerkin method, were discussed and compared.

For both model equations and the time-dependent problems, an order-of-magnitude speedup was achieved. The computing time has been reduced from hours to minutes with GPU-accelerated spectral methods. Furthermore, we have compared these two methods, with particular focuses on their accuracies, initializing costs, computational costs, and speedups. It is clear that GPGPU and the proposed spectral methods are an excellent match. We conclude that spectral methods can be done very fast by utilizing the GPU and new CUDA tools, and are expected to be capable of delivering real-time simulations for many practical problems.