1 Introduction

Finite element method (FEM) is one of the most extensively used numerical methods to solve real-world problems governed by ordinary/partial differential equations. The popularity of FEM is primarily due to its ability to handle complex geometries, high accuracy, and applicability to a wide range of problems. However, FEM can be computationally expensive for complex real-life problems [3, 19, 22] that require a large number of degrees of freedom (DOFs) to obtain desired solution. Although there has been an exponential increase in computational resources, the computational cost of FEM is still the main bottleneck for many large-scale problems.

In the literature, the high performance computing (HPC) techniques have been used to handle expensive computation required in FEM [41, 45]. Recently, Graphics Processing Unit (GPU)-based computation has become immensely popular for HPC implementation. The importance of GPU for HPC applications can be understood from the fact that many supercomputers in Top500 list have GPU on each of its node [42]. It is because GPU is a massively-threaded many-core processor architecture that houses a large number of computational units for efficiently handling parallel workload.

The HPC techniques can be used for FEM since the computation involved can be done in parallel. However, FEM remains computationally expensive due to involvement of various steps. In FEM, a mesh is first created by dividing the problem domain into a number of polygon/polyhedra shaped entities known as elements. The governing field equations are recast into integral form (called the “weak form”) and subsequent approximation of the primary variable over each element leads to elemental stiffness and force matrices respectively. The elemental matrices are then assembled into a global matrix using the mesh connectivity matrix. After application of suitable boundary conditions, the assembled matrix is solved using any suitable direct or iterative linear solver to obtain the value of unknown field variables. Every step in FEM procedure can incur significant computational overhead depending on the type of problem being solved. The previous attempts to accelerate each step of FEM on GPU have achieved a significant performance over single and multi-core CPU. The speedup of several folds has been observed in the elemental matrix evaluation [25, 31, 38, 46] and its assembly [10, 16, 20, 34, 37, 39] to the global matrix. However, it is found that the elemental matrix evaluation and assembly consume smaller fraction of the total computational time as compared to the solution of system of equations. Consequently, the acceleration of linear solver on GPU has received more attention [2, 5, 24, 35]. The iterative solvers are more preferred for solution of a large system of equations. In addition of being memory efficient, the iterative solvers provide abundant amount of parallelism making them suitable for GPU. The main computational components of an iterative solver are sparse matrix-vector product (SpMV), vector-dot product, scalar-vector product and vector-vector addition/subtraction, respectively. Among these, SpMV is the most computationally expensive operation [44]. Since efficient parallel implementation of operations like vector-dot product can be achieved easily, the performance of an iterative solver directly depends on the performance of SpMV operation. A SpMV operation performs multiplication of a sparse matrix stored in specialized formats (like CSR, COO, ELL, etc. [7]) with a vector. These sparse storage formats reduces the storage requirement of a sparse matrix. However, they introduce irregular memory access pattern that prevents realization of true computational performance of a GPU device. In spite of these challenges, the SpMV operation has benefited greatly by the use of GPU acceleration [4, 7, 15].

Another way to improve the performance of an iterative solver is to replace SpMV operation of a sparse matrix with a vector by a matrix-free approach. In this approach, the multiplication is performed at the level of smaller dense constituent matrices. Since the elemental matrices are dense and of the same size, the matrix-free approach can provide finer level of parallelism along with regular memory access pattern suitable for GPU. The matrix-free approach was first introduced in [18] for the low memory vector machines and primarily used for solving large FEM problems on microprocessor with limited memory [9]. The recent revival of interest in the matrix-free approach can be attributed to an introduction of massively parallel many-core architectures. Since the advent of CUDA in 2006 the matrix-free approach has been actively pursued by many researchers. GPU-based matrix-free FEM solver has been developed for applications like elasticity [27], heat conduction [21], weather prediction [29], fluid flow [14], and topology optimization [33] among many others. The elemental matrices in the matrix-free solver can be precomputed for all elements in the mesh and stored as dense matrices or it can be recalculated on-the-fly [6, 23] during matrix-vector product. As shown in [34], the approach that recalculates elemental matrices becomes highly compute bound for GPU implementation, whereas the best performance is achieved by the local matrix approach for low-order FEM. The local matrix approach is also found to perform better than the assembly-based solver [26, 34]. Overall, the superior performance of the matrix-free approach can be ascribed to lesser memory transfer, better access pattern, and fine grain parallelism. However, the performance of the matrix-free approach on GPU is still found to be limited by memory bandwidth.

To the best of author’s knowledge studies implementing the matrix-free approach use full elemental matrices for matrix-vector product evaluation. The elemental matrices obtained in FEM are symmetric for most of the problems. Implementing matrix-vector product using the symmetric part of the elemental matrices can significantly reduce the storage requirement as well as data transfer. On the memory bound architectures like GPU reduction in data requirement is expected to improve the performance of a kernel substantially. So far the matrix-free approach has not been implemented on GPU using only the symmetric part of the elemental matrices. Therefore, the main contributions of this paper are as follows.

  1. 1.

    A novel matrix-free solver for FEM is developed which uses only the symmetric part of the elemental matrices.

  2. 2.

    In order to optimize the data access pattern, a unique data structure is developed which ensures coalesced memory access for efficient GPU implementation while storing only the upper triangular part of elemental matrices.

  3. 3.

    Comparative analysis of the proposed solver with the existing matrix-free methods is presented on two test problems using linear quadrilateral elements over unstructured mesh generated through an FEM software package.

The paper is organized as follows. Section 2 presents the existing and the most common matrix-free solvers for FEM and discusses their GPU implementations. In Sect. 3, the proposed matrix-free method is described along with the data structure and access pattern. The performance evaluation and comparison are presented in Sect. 4. Section 5 concludes the paper with scope of future work.

2 Background

2.1 Matrix-free FEM

The finite element discretization produces the global system of algebraic equations

$$\begin{aligned} {\mathbf {K}}{\mathbf {U}}={\mathbf {F}}, \end{aligned}$$
(1)

where \({\mathbf {K}}\) is the sparse global stiffness matrix, \({\mathbf {U}}\) is the unknown displacement vector and \({\mathbf {F}}\) is the global extended force vector. The global stiffness matrix and nodal force vector are assembled from the elemental matrices as

$$\begin{aligned} {\mathbf {K}}= & {} \mathop { {\mathcal {A}}}_{e\in {\mathcal {E}}}{\mathbf {K}}^e, \end{aligned}$$
(2)
$$\begin{aligned} {\mathbf {F}}= & {} \mathop { {\mathcal {A}}}_{e\in {\mathcal {E}}}{\mathbf {F}}^e, \end{aligned}$$
(3)

where \({\mathbf {K}}^e\) is the elemental tangent matrix, \({\mathbf {F}}^e\) is the elemental force vector, \({\mathcal {E}}\) is the set of all elements in the mesh and \( {\mathcal {A}}\) is the assembly operator. The iterative approach to solve system of equations (Eq. (1)) requires the multiplication of sparse global stiffness matrix \({\mathbf {K}}\) with a given vector \({\mathbf {y}}\), in each iteration step. There are three major strategies by which a matrix-free solver can be implemented on a GPU.

  1. 1.

    Node-by-Node (NbN)

  2. 2.

    Degrees-of-Freedom -by- Degrees-of-Freedom (DbD)

  3. 3.

    Element-by-Element (EbE)

2.2 NbN strategy

In the NbN strategy, the computation of matrix-vector product is performed by moving through each node of the mesh. Every node has its corresponding rows in the global stiffness matrix. The multiplication of each row is done with the given vector and the result is accumulated into an array. Since the global stiffness matrix is not constructed explicitly, each row corresponding to a node needs to be assembled before multiplication. To generate a row in a global stiffness matrix, contributions from all neighboring elements are needed. In practice, first, the required entries in elemental matrices of neighboring elements are multiplied with corresponding vector entries and then the result is assembled [8, 27], which is given as

$$\begin{aligned} {\mathbf {p}}^{(n)}=\mathop {\sum }_{e\in {\mathcal {E}}^{(n)}}({\mathbf {K}}^e_n{\mathbf {y}}^e), \end{aligned}$$
(4)

where \({\mathcal {E}}^{(n)}\) is the set of elements connected to node \(n \), \({\mathbf {K}}^e_n\) is the contribution of stiffness matrix toward node \(n \), \({\mathbf {p}}\) is the resultant vector and \({\mathbf {y}}^e\) is the elemental sub-vector of vector \({\mathbf {y}}\) with which the multiplication has to be done. The NbN strategy for two-dimensional problem having two DOFs per node (\(N_{dof}\)) is shown in Algorithm 1. Since the computation is performed by moving through each node, list of elements connected to node n (\({\mathcal {E}}^{(n )}\)) is computed in step 2 in terms of the Node connectivity matrix. For each element in the node connectivity, the elemental connectivity matrix (\({\mathbf {E}}^{(e)}\)) is found along with the local position (q) of the node in the element. The element stiffness matrix \({\mathbf {K}}^e\) is read on the basis of q. Finally, the required product is calculated in steps 7 and 9 for all DOFs associated with the node.

figure a

GPU parallelization is done over the nodes of the mesh. Single thread is assigned to do computation for one node. The node connectivity array and local position array can be reordered to read in a coalesced manner. The elemental stiffness matrices are read from strided locations in the global memory and hence cannot be coalesced. The elemental connectivity matrix is arranged column-wise for minimizing the global memory transactions. The access to vector \({\mathbf {y}}\) is also not coalesced and it is read through the read-only cache.

It can be observed that each node performs its computation independently and therefore, the problem of data race conditions [12] does not arise. This is a major advantage of the NbN strategy since overhead associated with synchronization mechanism like coloring can be avoided.

For an unstructured mesh, each node can have different number of neighboring elements. This leads to an unequal amount of workload distribution on threads. A GPU warp [12] remains active as long as any of its threads is working. This is not desirable for an efficient utilization of GPU resources that leads to the major disadvantage of the GPU-based NbN strategy.

2.3 DbD strategy

The DbD strategy performs the matrix-vector multiplication by moving through each DOF of the system. Here, computation corresponding to each DOF associated with a node is seen as an independent task. In the DbD strategy, computation of matrix-vector product is implemented as [28]

$$\begin{aligned} {\mathbf {p}}^{(u)} = \mathop {\sum }_{e\in {\mathcal {E}}^{(u)}}{\mathbf {K}}^e(m^{-1}(u),:){\mathbf {y}}^e, \end{aligned}$$
(5)

where \({\mathbf {K}}^e\) is the elemental stiffness matrix, m is the local to global mapping and \({\mathcal {E}}^{(u)}\) is the set of elements connected to DOF u. The DbD strategy is implemented in a way similar to Algorithm 1. Single thread per DOF assignment is used to perform the computation. Input data structure remains identical as Algorithm 1 but now the same data is read by as many threads as the value of \(N_{dof}\). Each thread performs its own computation and accumulates the result into an array in a coalesced manner. This strategy is also known as Row-by-Row solution method [43].

The GPU-based DbD strategy has finer level of granularity than the NbN strategy. However, the input data requirement remains the same as that for the NbN strategy. Moreover, since the same data is required for all the threads associated with a particular node, either it can be read redundantly from the global memory or can be shared among threads using the shared memory. The limited size of the shared memory restricts its use to few cases and generally, data is read redundantly from the global memory. The DbD strategy also suffers from the same load imbalance problem found with the NbN strategy.

2.4 EbE strategy

In the EbE strategy, the computation of matrix-vector product takes place at the elemental level. The obtained result is then assembled to get the final solution. This can be expressed as

$$\begin{aligned} {\mathbf {p}} = \mathop { {\mathcal {A}}}_{e\in {\mathcal {E}}}({\mathbf {K}}^e{\mathbf {y}}^e) \end{aligned}$$
(6)

where \({\mathbf {K}}^e\) is the elemental stiffness matrix, \({\mathbf {y}}^e\) is the multiplying vector transformed to the elemental level, \( {\mathcal {A}}\) is the assembly operator, \({\mathcal {E}}\) is the set of all elements in the mesh and \({\mathbf {p}}\) is the resultant vector.

figure b

Algorithm 2 shows the implementation of the EbE strategy. The multiplication of elemental stiffness matrix with vector \({\mathbf {y}}\) is performed by using the local to global mapping \({\mathcal {D}}(e ,j)\) in step 4 and the result is stored in vector val in step 5. The resultant vector p is obtained by assembling val in step 9.

In GPU implementation, computation for each element is performed in parallel. There are three prominent ways of distributing workload among threads which are as follows.

  1. 1.

    Single thread per element: In single thread per element approach [21], one thread is responsible for reading the input data, computing elemental matrix-vector product, and accumulating calculated value to the resultant vector. This approach is the simplest to implement. However, each element gets an amount of on-chip memories (shared memory and register) corresponding to a thread only. Therefore, this approach suffers from poor utilization of fast on-chip memories.

  2. 2.

    Single thread per node: The single thread per node approach allocates as many threads to an element as the number of nodes. Each thread performs computation for all DOFs associated with the node.

  3. 3.

    Single thread per DOF: The finest level of granularity is achieved in the single thread per DOF approach. Here, the number of threads equal to DOFs associated with an element is allocated [27, 32]. The elemental matrix-vector product is decomposed into several inner-vector products corresponding to each row of the matrix. Each thread is assigned to do computation for one inner-vector product. This approach also provides the highest amount of on-chip memory per element.

After the elemental matrix-vector product is obtained, it needs to be assembled into a final global vector as shown in Algorithm 2. Each non-zero entry in the global vector corresponds to a DOF of the system. Each boundary DOF is shared among multiple elements of the mesh. During parallel assembly of elemental resultant vector, multiple elements tend to put their calculated value to the same location in global vector simultaneously. Such kind of memory operation leads to the problem of data race condition. In the EbE strategy, the problem of race conditions must be addressed by the use of suitable synchronization mechanism like coloring, atomics or using a separate assembly kernel [27].

The EbE strategy, having single thread per DOF allocation, is found to have the best performance [27]. Apart from providing the finest level of parallelism, the strategy has simple data access pattern, balanced load distribution, and can provide better utilization of on-chip memory. Along with the coloring method to handle data race conditions, this strategy is used as a reference in this work.

3 Proposed matrix-free strategy

3.1 Matrix-free solver exploiting symmetry of elemental matrices

The major steps involved in the GPU-based computation of matrix-vector product \({\mathbf {K}}{\mathbf {y}}\) in a matrix-free manner can be seen in Fig. 1. The matrix-vector product can be computed in three steps: (1) transformation of multiplying vector \({\mathbf {y}}\) to elemental vector \({\mathbf {y}}^e\), (2) computation of dense matrix-vector product with elemental matrices, and (3) assembly of computed results. The steps 1 and 3 primarily involve data scatter and gather type operation with little arithmetic load. In step 2, the computation of dense matrix-vector is performed for all elements which makes it computationally expensive as compared to the other two steps. The efficient implementation of step 2 is therefore crucial for better performance of the matrix-free solver.

Fig. 1
figure 1

Major steps involved in the matrix-free computation of elemental matrix-vector product

In all existing matrix-free strategies, the computation in step 2 of Fig. 1 is performed by using full elemental stiffness matrices. The proposed strategy makes use of the fact that the elemental matrices in linear FEM is symmetric in most of the cases. This property can be used to perform the elemental dense matrix-vector product using only the lower or upper triangular part of the matrix. The matrix-vector product in the proposed strategy is implemented using only the symmetric part of the elemental matrices as

$$\begin{aligned} {\mathbf {p}} = \mathop { {\mathcal {A}}}_{e\in {\mathcal {E}}}({\mathbf {K}}^{e}_{sym}{\mathbf {y}}^e), \end{aligned}$$
(7)

where \({\mathbf {K}}^e_{sym}\) is the symmetric part of elemental stiffness matrix. The proposed matrix-free strategy for FEM is referred to as \(EbE_{sym}\) and can be represented in the graphical form as shown in Fig. 2. It can be seen that the dense matrix-vector product required in the matrix-free solver is replaced by a dense symmetric matrix-vector product (SYMV) which requires only the symmetric part of the elemental matrices. In the proposed strategy, all steps mentioned in Fig. 1 is implemented as a single computational kernel with the coloring method to avoid data race conditions.

Fig. 2
figure 2

Illustration of the proposed \(EbE_{sym}\) strategy

The computation of a SYMV operation can be performed by following Algorithm 3. The multiplication of each row with vector \({\mathbf {y}}^e\) is performed by looping over the total number of rows \(K_{size}\), where the computation is performed first (step 3) for the upper triangular part of the matrix. In step 7 of Algorithm 3, the computation of missing symmetric part (lower triangular part) is performed where indices k and i to the matrix \({\mathbf {K}}^e\) always refer to the values in the upper triangular part. This shows that the SYMV computation can be implemented by storing only the upper or the lower triangular part of the symmetric matrix.

figure c

For GPU implementation, the storage of the symmetric part of the matrix can be done in any suitable format which facilitates uniform memory access pattern during computation. However, the same storage format may not be suitable for the computation of missing symmetric part. As shown in step 7 of Algorithm 3, values of matrix \({\mathbf {K}}^e\) are accessed from the stored symmetric part in a strided and nonuniform manner. Such kind of access pattern wastes the memory bandwidth of GPU and consequently degrades the performance. The optimization of memory access pattern for a SYMV operation on GPU is suggested by many authors [1, 11, 30]. These studies discuss the strategies to compute the SYMV operation for moderate to large size matrices. However, the approaches available in the literature are either not applicable or not optimal for small size matrices (less than 50) generally found in low-order FEM. Moreover, in the current work, computation has to be performed in a batch for millions of elements. Since matrix-vector product has very low arithmetic intensity, efficient handling of memory overhead becomes indispensable for batch implementation of SYMV. For better performance on GPU, it, thus, becomes extremely important to minimize the data transfer and use coalesced and localized memory access pattern. The proposed \(EbE_{sym}\) strategy addresses all these issues by adopting a novel data structure which ensures coalesced memory access while storing only the symmetric part of the elemental matrices. In order to obtain the best performance, the \(EbE_{sym}\) strategy seeks to make an efficient use of register cache by using CUDA shuffle instruction. This not only helps in relaxing the shared memory size restrictions but also avoids the data movement to-and-fro from the shared memory. Single thread per node assignment similar to EbE strategy (Sect. 2.4) is used to achieve balanced workload distribution. Also, each thread performs its task independently so that no synchronization barrier is required. In order to demonstrate the performance of the proposed strategy, quadrilateral element with linear basis function is considered.

Fig. 3
figure 3

Organization of the elemental stiffness matrix for a 4-noded quadrilateral element with two DOFs per node

3.2 Kernel design and data structure for \(EbE_{sym}\) strategy

In the proposed strategy, the elemental stiffness matrix is divided into a number of sub-matrices as shown in Fig. 3. The figure shows nonzero entries in the upper triangular part of the elemental stiffness matrix for 4-noded quadrilateral element with \(N_{dof}=2\). Each node is associated with as many rows and columns in the matrix as the value of \(N_{dof}\). The size of sub-matrices is kept equal to \(N_{dof}\) and it contains the values corresponding to one node in row and one node in column. For example, the following sub-matrix contains all the computed values between node 1 in row and node 2 in column.

$$\begin{aligned} \left[ \begin{array}{cc} \text {D}_1 &{} \text {E}_1\\ * &{} * \\ \end{array} \right] \end{aligned}$$

Depending on the position in the matrix the sub-matrices are categorized into two groups: diagonal and off-diagonal. The diagonal group contains all sub-matrices lying on the diagonal of the elemental matrix, such as

$$\begin{aligned} \left[ \begin{array}{cc} \text {A}_1 &{} *\\ * &{} \text {B}_1 \\ \end{array} \right] , \left[ \begin{array}{cc} \text {A}_2 &{} *\\ * &{} \text {B}_2 \\ \end{array} \right] , \text {etc.} \end{aligned}$$

The off-diagonal group contains sub-matrices that are not unique. These sub-matrices appear in the symmetric part also. In case of the diagonal group, all sub-matrices are associated with only one node number (the rows and column nodes are same). In case of the off-diagonal group, two such numbers exist, that is, one associated with the row and other with the column. Therefore, each sub-matrix is identified with a nodal index of the form \(K^e\{n,m\}\) which represents sub-matrix at \(n^{th}\) node in row and \(m^{th}\) node in column. The row node number is used to find global DOF to store result of multiplication, whereas the column node number is used to find global column indices of vector \({\mathbf {y}}\) to perform multiplication.

The computation of symmetric matrix-vector product is divided into two stages. In the first stage, the computation for the diagonal group is performed whereas in the second stage, multiplication for the off-diagonal group is done.

Fig. 4
figure 4

Data access pattern for diagonal group

The sub-matrices in the diagonal group are unique. Since computation is performed in a node-wise manner, these sub-matrices contribute to the matrix-vector multiplication results for their respective node number. Thread assigned to each node reads all the unique entries from its sub-matrix and performs multiplication with \({\mathbf {y}}\) vector. The data access pattern for the diagonal group is shown in Fig. 4. It shows four threads accessing the values marked as A and B (also shown in Fig. 3) from four sub-matrices of the diagonal group. Here, in Fig. 4, superscript represents the element number and the subscript denotes the sub-matrix position in the diagonal group. The other entries of a sub-matrix are accessed in a similar way. In order to achieve coalesced access for a warp, data for other elements are stored side by side. Once these values are read, they get multiplied by the given vector and stored in the shared memory. Here, each thread uses its global node number to read values from vector \({\mathbf {y}}\) through the read-only cache. The read from \({\mathbf {y}}\) vector is not coalesced.

The off-diagonal entries in the symmetric matrix are not unique. The transpose of sub-matrices in the off-diagonal group can be obtained if the row and column nodes are interchanged, as shown in Fig. 3. It can be seen that the sub-matrix located at \(K^e\{1,2\}\) appears in its transposed form at \(K^e\{2,1\}\). This implies that the same sub-matrix can be used to perform the computation for both node 1 and node 2. Similarly, the sub-matrix at position \(K^e\{1,3\}\) can be used for both node 1 and node 3. Thus, for a 4-noded quadrilateral element, the computation for two sub-matrices can be performed simultaneously. The computation for the off-diagonal group is implemented such that each thread is assigned with an equal workload. Therefore, the sub-matrices for simultaneous computation must be chosen judiciously. As shown in Fig. 3, the sub-matrices having the same type of enclosing can be processed at the same time. If chosen otherwise, any one thread can remain idle and others may have to do their task.

Since the computation of matrix-vector product uses only the symmetric part of the matrix, the sub-matrices in the missing part (lower triangular part in Fig. 3) must be obtained separately or shared between two threads. Due to the limited size of the shared memory, values in sub-matrices are read redundantly from the global memory. However, data is arranged such that it results into a broadcast. The broadcast from the global memory is although slower than the shared memory, it has lower overhead than reading values separately. The data access pattern for the off-diagonal entries is shown in Fig. 5. Here, D denotes the corresponding value in sub-matrices with same type of enclosing (refer to Fig. 3) in which subscript indicates the sub-matrix within an element and superscript indicates the element number. It can be seen that threads \(T_1\) and \(T_2\), assigned to node 1 and 2, read the same \(D^1_1\) value whereas threads \(T_3\) and \(T_4\) read \(D^1_2\) value which is required to perform computation of node 3 and node 4. The F values are stored and read in a similar way, except the values are now required by different set of threads. The data for all elements are kept beside each other to enable coalesced access for a warp.

Fig. 5
figure 5

Data access pattern for off-diagonal group

Once the data is read, it is multiplied with the corresponding values of vector \({\mathbf {y}}^e\). The \({\mathbf {y}}^e\) vector is extracted from \({\mathbf {y}}\) vector by using column indices of stiffness matrix entries. In FEM, column indices of an elemental stiffness matrix can be obtained by the global node numbers and \(N_{dof}\). In particular, the column node numbers of each sub-matrix can be used to obtain column indices of its entries. It can be observed from Fig. 3 that threads working over a sub-matrix either contain row node number or column node number of the sub-matrix. The row node number becomes column node number for a sub-matrix after it gets transposed. Thus, the global node number of two threads can be interchanged to get column node number of the sub-matrix. This is achieved in the proposed strategy by using the warp-shuffle instruction. Using the warp shuffle feature, the proposed strategy prevents the use of the shared memory as well as the global memory access. The warp shuffle feature is found to be more faster than the shared memory and leads to better utilization of register cache [20].

4 Results and discussion

The efficiency and performance of the proposed \(EbE_{sym}\) strategy are evaluated by solving elasticity and steady-state heat conduction equations in two dimensions (2D). The elasticity equation is solved over cantilever and L-shaped beam, and the steady-state heat equation is solved over a plate with multiple holes. Further, the performance of the proposed strategy is compared with the existing matrix-free strategies discussed in Sect. 2. In all the numerical problems, a symmetric positive-definite system of equations is obtained due to finite element discretization. Since the conjugate gradient (CG) solver is the most efficient and widely used iterative solver for the symmetric positive-definite system [36, 40], it is chosen as a solver in this work. It is important to note that the proposed strategy is equally applicable to other iterative solvers including multigrid [17]. Coloring method is used to handle data race conditions with the EbE and \(EbE_{sym}\) strategies. However, any data race condition is not observed with the NbN and DbD strategies.

The geometry of cantilever beam and L-shaped beam problems is relatively simple, and hence, structured mesh is generated through an FEM software package called as ABAQUS. However, the data structure generated by ABAQUS is unstructured and the same is used in this work. All the strategies have been implemented by considering the mesh as unstructured and do not use any simplification of the mesh to alter the performance. The plate with multiple holes problem is solved with unstructured mesh which is also generated through ABAQUS.

The hardware used consists of NVIDIA Tesla K40 GPU and Intel Xeon (R) E5-2650 CPU. The CPU consists of 12 physical cores clocked at 2.2 GHz and the GPU consists of 2880 cores clocked at 745 MHz. The CUDA runtime version 9.2 is used. All the numerical results are obtained using the double precision floating point arithmetic.

4.1 Elasticity problem

The following elasticity equation is considered over the domain \(\varOmega \),

$$\begin{aligned} \varvec{\nabla } \cdot \varvec{\sigma } + {\varvec{b}}&= 0, \quad \forall ~{\varvec{x}} \in \varOmega , \end{aligned}$$
(8)

which is subjected to the following boundary conditions,

$$\begin{aligned} {\varvec{u}}({\varvec{x}})&= {\varvec{u}}_0, \quad {\varvec{x}} \in \varGamma _u, \\ {\varvec{t}}({\varvec{x}})&= \varvec{{\bar{t}}}, \quad {\varvec{x}} \in \varGamma _t, \end{aligned}$$

where \(\varvec{\sigma }\) is the Cauchy stress tensor, \({\varvec{b}}\) is the body force per unit volume, \({\varvec{u}}\) is the unknown displacement variable, \({\varvec{u}}_0\) is the specified displacement on the boundary \(\varGamma _u\) and \(\varvec{{\bar{t}}}\) is the given traction on the boundary \(\varGamma _t\). The elasticity equation is solved for cantilever beam and L-shaped beam under plane stress condition and linear strain-displacement relation. The problem geometry along with the dimensions and boundary conditions are shown in Figs. 6 and 7 , respectively. The material properties are taken as: Young’s modulus (E) = 210 GPa and Poisson’s ratio (\(\nu \)) = 0.3. The domain is discretized using 4-noded quadrilateral elements having two DOFs per node. The problems are solved for different level of mesh refinement to evaluate the performance at various workload. Tables 1 and 2 present the mesh with different number of elements and corresponding DOFs for 2D cantilever beam and L-shaped beam, respectively.

Fig. 6
figure 6

A 2D cantilever beam with end load. All dimensions are in meters (m)

Fig. 7
figure 7

L-shaped beam. All dimensions are in meters (m)

Table 1 Finite element mesh for 2D cantilever beam
Table 2 Finite element mesh for L-shaped beam
Fig. 8
figure 8

A plate with multiple holes. All dimensions are in meters (m)

4.2 Steady-state heat conduction problem

The following steady-state heat conduction equation is solved over a plate with multiple holes as shown in Fig. 8.

$$\begin{aligned} \begin{aligned} \nabla \cdot (\varvec{\kappa } \cdot \nabla T({\varvec{x}}))&= f({\varvec{x}}), \quad {\varvec{x}} \in \varOmega , \\ T({\varvec{x}})&=g({\varvec{x}}), \quad {\varvec{x}} \in \varGamma _g, \\ {\varvec{n}}({\varvec{x}}) \cdot \varvec{\kappa } \cdot \nabla T({\varvec{x}})&= 0, \quad {\varvec{x}} \in \varGamma _q, \\ \varGamma _g&= \varGamma _{g1} \cup \varGamma _{g2}, \\ \end{aligned} \end{aligned}$$
(9)

Here, \(T({\varvec{x}})\) is the unknown temperature field, \(f({\varvec{x}})\) = 0, \(g({\varvec{x}}) = 200\) on \(\varGamma _{g2}\) and 10 on \(\varGamma _{g1}\) and \(\varvec{\kappa }\) is the thermal conductivity matrix which is taken as identity. Table 3 lists the mesh with various level of refinement used in this analysis. The domain is discretized with 4-noded quadrilateral element with single DOF per node.

Table 3 Finite element mesh for steady-state heat conduction problem

4.3 Performance results

The performance of the \(EbE_{sym}\) strategy is assessed on the basis of the kernel time, arithmetic throughput (GFLOP/s) and memory bandwidth. The kernel time is evaluated using the CUDA event function, whereas the GFLOP/s and bandwidth are calculated by the metrics given by nvprof profiler. Here, the kernel time is referred to as the execution time of CUDA kernel in one iteration of the CG solver. For the NbN and DbD strategies only one kernel is launched per iteration but for the EbE and \(EbE_{sym}\) strategies separate kernel is launched for each color. Therefore, the kernel time in case of the EbE and \(EbE_{sym}\) strategies includes execution time for all the colors. It is noted that the NbN and DbD strategies are equivalent in case of steady-state heat transfer problem since each node has only one DOF. Hence, results for the DbD strategy are not presented for the heat transfer problem.

Figure 9 shows the comparison of kernel time for different matrix-free strategies for the elasticity and heat transfer problems. The NbN strategy takes the highest amount of kernel time in all the test problems. It also has the highest amount of data requirement as compared to all the other strategies. With the same data structure the DbD strategy achieves better timings than the NbN strategy by just increasing the granularity of computation. The redundant access of data in case of the DbD strategy does not seem to have much overhead as the values are broadcasted to \(N_{dof}\) threads from the global memory. Also, access to the elemental matrix requires lesser number of transactions as compared to the NbN strategy as more number of threads now accesses the same matrix. However, in both the NbN and DbD strategies, gather operation is performed to read the elemental matrices in an uncoalesced manner. The elemental matrices constitute the largest amount of data that a matrix-free solver needs to access. As evident from less kernel time of the EbE strategy (Fig. 9) compared to the NbN and DbD strategies, the uncoalesced access to the elemental matrices has a large impact on the performance. In the EbE strategy, the elemental matrices are accessed in a coalesced manner. Apart from better memory access pattern, the EbE strategy has the finest level of granularity, less data requirement, and equally distributed workload on each of the computational threads. With all these characteristics, the EbE strategy overcome the overhead associated with race conditions handling and achieves the least kernel time among the existing matrix-free strategies.

Fig. 9
figure 9

Comparison of kernel time for test problems

Fig. 10
figure 10

Speedup achieved by \(EbE_{sym}\) strategy over the other matrix-free strategies

The proposed \(EbE_{sym}\) strategy outperforms all the other strategies in every test problem (refer to Fig. 9). Since the \(EbE_{sym}\) strategy inherits the major characteristic of the EbE strategy, a better performance compared to the NbN and DbD strategies is expected. However, the superior performance compared to the EbE strategy can be mainly attributed to the reduction in data movement due to use of only the symmetric part of the elemental matrices. In elasticity problem, 4-noded quadrilateral elements with two DOFs per node is used which gives the elemental matrix of size \(8 \times 8\). The implementation and optimization of matrix-vector product for such a smaller size matrix in a batch mode is extremely challenging as it involves very low arithmetic load compared to the required amount of data movement. The proposed \(EbE_{sym}\) strategy achieves better kernel time compared to the EbE strategy due to a unique data structure that ensures localized and coalesced access pattern using only the symmetric part of the elemental matrices. The reduction in data requirement helps in maintaining a higher computation to data movement ratio than the EbE strategy which is favorable for GPU implementation. Moreover, the \(EbE_{sym}\) strategy also achieves better execution times for the heat transfer problem (Fig. 9c) which uses the elemental matrices of size \(4 \times 4\).

Fig. 11
figure 11

GPU memory utilization by various strategies

The speedup obtained by the \(EbE_{sym}\) strategy over the other strategies for matrix-free solver is shown in Fig. 10. In all the three problems, a consistent speedup is observed which suggests that the \(EbE_{sym}\) strategy is able to scale well with increasing problem size. With respect to the NbN strategy, approximately 5\(\times \) speedup is observed for both cantilever and L-shaped beam problems. In the case of the DbD strategy, approximately 2.8\(\times \) speedup is observed for elasticity problems. Relatively lower speedup is observed for the case of heat transfer problem because smaller elemental matrices have less memory overhead in the case of the NbN or DbD strategy. The proposed strategy could achieve 1.4\(\times \) speedup over the EbE strategy for elasticity problems and 1.3\(\times \) speedup in the case of the heat transfer problem. Here, a similar speedup in both the cases illustrates the suitability of the proposed strategy for extremely small size elemental matrices.

Figure 11 shows the amount of GPU memory occupied by different strategies as a function of the problem size. It can be observed that the proposed \(EbE_{sym}\) matrix-free strategy consumes the least amount of GPU memory. This suggests that a much larger problem can be solved by the proposed strategy on a given GPU card in lesser amount of time. The NbN and DbD strategies occupy the highest amount of memory in all the numerical problems. This is due to the dependency of these strategies on arrays like node connectivity and local position of nodes in each element in addition to the elemental connectivity and the elemental matrices. The EbE strategy only stores the elemental connectivity and the elemental matrices on GPU and therefore requires lesser memory than the NbN and DbD strategies. The least amount of memory consumption by the \(EbE_{sym}\) strategy is due to the storage of only the symmetric part of elemental matrices. For elemental matrix of size \(8 \times 8\) and \(4 \times 4\) the \(EbE_{sym}\) strategy uses only 36 and 10 number of entries respectively to perform computation. This leads to 1.7\(\times \) and 1.4\(\times \) reduction in data at the elemental level. Overall, the \(EbE_{sym}\) strategy requires 1.5\(\times \) less memory in elasticity problem and 1.3\(\times \) less memory in heat transfer problem (refer Fig. 11) than the EbE strategy. The ratio of total number of entries to unique entries in a symmetric matrix is given by

$$\begin{aligned} \frac{2n}{n+1}, \end{aligned}$$
(10)

where n is the size of matrix. As the size of matrix increases, this ratio tends to move closer to two. Thus, for higher order finite elements where the elemental matrix size is large use of the symmetric part of the matrix can save up to 50% of the data required in the EbE strategy. Furthermore, the performance of the \(EbE_{sym}\) strategy relative to the EbE strategy can also become better for larger matrix size.

Fig. 12
figure 12

GFLOP/s achieved by various matrix-free strategies

Fig. 13
figure 13

Memory bandwidth achieved by various matrix-free strategies

A deeper insight into the performance of various matrix-free strategies can be obtained by looking into arithmetic throughput and effective memory bandwidth achieved by them. Figure 12 shows the GFLOP/s achieved by the different matrix-free strategies as a function of mesh size. The arithmetic throughput of the proposed \(EbE_{sym}\) strategy having the least kernel time is found to be the highest in all the test problems. The EbE strategy shows significant improvement in GFLOP/s as compared to the NbN and DbD strategies. Compared to the EbE strategy, 1.5\(\times \) and 1.3\(\times \) better throughput are observed by the \(EbE_{sym}\) strategy in case of elasticity and heat transfer problems. However, GFLOP/s achieved by all the strategies is found to be much lower than the peak value of the device. The maximum value achieved by the \(EbE_{sym}\) strategy reaches approximately 3% of the peak value of NVIDIA Tesla K40.

The comparison of effective memory bandwidth achieved by various strategies is shown in Fig. 13. The memory bandwidth by all the strategies is found to be on the higher side indicating memory bound nature of the matrix-free solvers. The NbN and DbD strategies are found to have relatively lesser bandwidth than the EbE strategy. The EbE strategy achieves 212 GB/s for the case of elasticity problems and 208 GB/s in heat transfer problem. The proposed \(EbE_{sym}\) strategy shows small improvement over the EbE strategy and achieves bandwidth of 215 GB/s and 209 GB/s in elasticity and heat transfer problems, respectively. The maximum bandwidth achieved by the \(EbE_{sym}\) strategy is found to be approximately 74% of the theoretical peak value of NVIDIA Tesla K40 GPU which is close to the achievable bandwidth of the device [13]. Since the maximum arithmetic throughput is found to be close to 3% of theoretical peak throughput, it can be concluded that the performance of the proposed strategy is limited by memory bandwidth.

5 Conclusions

A new GPU-based matrix-free strategy (\(EbE_{sym}\)) for FEM has been proposed. The developed strategy was based on an element-by-element FE solver which replaced the SpMV operation in an iterative solution method by an element level dense matrix-vector product. A new approach to compute the elemental matrix-vector product was developed which used only the symmetric part of the elemental matrices. The performance of the proposed solver was evaluated by solving the elasticity and the heat transfer problems on unstructured mesh using 4-noded quadrilateral element and the comparison was made with the existing GPU-based matrix-free solvers. For the elasticity problems (two DOFs per node), approximately 5\(\times \) speedup was observed over the node based (NbN), 2.8\(\times \) over the DOF based (DbD) and 1.4\(\times \) over the element based (EbE) matrix-free strategies. In heat conduction problem (single DOF per node), 3\(\times \) speedup over the NbN strategy and 1.3\(\times \) speedup over the EbE strategy were obtained. As a consequence of using the symmetric part of the elemental matrices the overall memory footprint of the proposed \(EbE_{sym}\) strategy was reduced by 1.5\(\times \) for the elasticity and 1.3\(\times \) for the heat conduction problems over the state-of-the-art EbE strategy. The obtained results suggest that the proposed strategy can be used to solve problems of bigger sizes on a given GPU card in lesser time. Also, the proposed strategy is applicable where the symmetric elemental matrices are generated by FEM. In future, the proposed strategy can be applied to various element types including higher order and three-dimensional elements to study its performance and identify the limitations, if any. The outcome of the future work is expected to make this strategy more generic and applicable to a broader class of problems in FEM. Moreover, the proposed strategy can also be used to develop kernel for batched symmetric matrix-vector product using small size matrices for linear algebra applications.