1 Introduction

The availability of modern CAE Software has increased thanks to the improvement of Computer System and Graphics Technology. Nowadays, CPUs for personal computers (PC) have approximately 100 Gflop/s (Giga Floating-Point Operation per Second), and high performance workstations achieve many hundreds Gflop/s. This figure was the performance of the fastest supercomputer only 20 years ago [1]. High performance computers have made it possible to solve bigger mechanical systems and their linear equations. However, research has shown that numerical factorization time increases explosively as the size of a linear equation increases.

The easiest way to resolve the increase of computing time is to replace the computer system with more powerful CPUs. However, advances in CPU speeds are slowing down as technological limits throttle higher performance. The CPU clock speed per core cannot be faster than a specific speed related to power consumption and temperature limits, which is called a clock limit. One way to eliminate the limits and improve the performance is to utilize additional computing devices like GPUs. The origins of CPUs and GPUs are different. The purposes of CPUs were to run an Operating System (OS) and to manage all programs on the OS. In contrast, GPUs were to represent real-time screens and game pictures more naturally and colorfully. Meanwhile, there have been a lot of attempts to graft the high computing performance of GPUs onto numerical computations for general purposes [2]. GPUs generally include hundreds or thousands of cores, which are highly suitable for data parallel algorithms. The GPU devices are used to obtain solutions from the matrices since they have a superior floating-point performance to CPUs.

There are two kinds of widely used methods to solve the linear equation \(\mathbf{Ax}=\mathbf{b}\). One is an iterative method [3]. It consists of matrix–vector and vector–vector operations and could be applied to well-made sparse and dense BLAS routines. In addition, the parallelization for the iterative method is quite effective. These advantages fit well on additional computing devices, thus some researchers have studied a few kinds of iterative linear equation solvers on GPUs [4, 5]. NVIDIA-CUDA provides some sparse routines to support an iterative method by using cuSparse and cuBLAS Library [6]. ViennaCL was studied to research the solution of large systems of equations by means of iterative methods using optional pre-conditioners on various computing device like GPUs, MICs and CPUs [7]. These GPU-based iterative solvers have already been widely used in various fields like electronics [8], mechanics [911] and finance [12]. However, the indirect solvers become very slow when the matrix condition number is very high [13]. Mechanical dynamics usually handles high-stiffness problems, so the condition number of their matrices is extremely high. Therefore, convergence of the iterative methods for the mechanical system dynamics is very slow and it is not easy to find a proper pre-conditioner. These reasons have enforced us to use the direct method in the field of the mechanical system dynamics.

The direct method [14] consists of a finite number of floating-point operations so that it can always obtain an exact solution except for singular system matrices. In this research, a newly proposed multifrontal implementation is presented to maximize utilization of a GPU device. The method originated from carrying out assembly and Gaussian elimination of element matrices at the same time in the area of a finite element method [15, 16]. It has been widely studied and implemented in many large-scale finite element applications on CPUs [17]. The algorithm is relatively complicated compared to the iterative method due to reordering, fill-in and dynamic updatable sparse matrix structures. Also, it needs frequent copies of data between host and GPU memory spaces. For these reasons, studies on the direct method have been less active on GPUs especially. Davis applied CUDA acceleration to CHOLMOD and SPQR algorithms as parts of SuiteSparse linear algebra package [18, 19]. The two algorithms from SuiteSparse, however, are not appropriate to be used for a mechanical dynamic field since the CHOLMOD is a set of routine only for sparse positive definite matrices and the SPQR algorithm has a limitation of a matrix size. A lot of mechanical system matrices failed to carry out a GPU computation using the SPQR algorithm due to the lack of GPU memory size. Therefore, it was necessary to research and implement a linear equation solver using a new direct method for GPUs. Our purpose of this research is to implement a GPU-based direct linear equation solver and to optimize it to handle large mechanical system matrices.

This paper is organized as follows. Section 2 briefly summarizes a detailed numerical method for the equations of motion of constrained mechanical systems. Section 3 explains traditional DFS-based and proposed BFS-based nested dissection reordering methods. Section 4 presents supernodal and multifrontal methods first, and then introduces a proposed numerical factorization method. Section 5 explains some features of a GPU device and how the proposed implementation can be applied and optimized for a GPU. Sections 6 and 7 present how to determine an optimum maximum block size for the proposed implementation and discuss the results. The mechanical dynamic experiments have been carried out using the proposed method with DSS routine included in MKL. The performance, memory usage and solution accuracy are discussed. Conclusions are drawn in Sect. 8.

2 Equation of motion

2.1 Constrained mechanical system and integration methods

The constrained mechanical system is often represented as differential-algebraic equation (DAE). A solution of DAE is more difficult than that of ordinary differential equation (ODE). There are two methods to solve DAEs [20].

One method is to carry out an explicit numerical integration and to correct the integration variables so the variables of position, velocity, and acceleration level are satisfied. An advantage of this method is that the system equations are small because the correction is conducted sequentially. However, it also has a disadvantage. The time step for very stiff problems tends to be very small.

The other is an implicit numerical integration method. It can overcome the disadvantage of the explicit method. Kinematic constraints including their derivatives and equations of motion are solved simultaneously. However, a disadvantage of the implicit method is that the size of a system matrix is larger than that of the explicit method. This research investigates the equation solver for the large matrices on many-core GPUs.

2.2 Implicit integration for differential-algebraic equations

The equations of motion for a constrained mechanical system are described as

$$\begin{aligned} \mathbf{v} - \dot{\mathbf{q}} =& \mathbf{0}, \end{aligned}$$
(1)
$$\begin{aligned} \mathbf{F} ( \mathbf{q}, \mathbf{v}, \mathbf{a}, \pmb {\lambda} ) =& \mathbf{0} , \end{aligned}$$
(2)
$$\begin{aligned} \pmb{\varPhi} ( \mathbf{q}, {t} ) =& \mathbf{0}, \end{aligned}$$
(3)

where \(\mathbf{q}\) is the generalized coordinate vector in Euclidean space \(\mathbf{R}^{\mathbf{n}}\), \(\mathbf{v}\) is the generalized velocity vector in \(\mathbf{R}^{\mathbf{n}}\), \(\mathbf{a}\) is the generalized acceleration vector in \(\mathbf{R}^{\mathbf{n}}\), \(\pmb{\lambda}\) is the Lagrange multiplier vector for constraints in \(\mathbf{R}^{\mathbf{m}}\), \(\pmb{\varPhi}\) represents the position level constraint vector in \(\mathbf {R}^{\mathbf{m}}\), and the Jacobian \(\pmb{\varPhi}_{\mathbf{q}} \in\mathbf {R}^{\mathbf{m}\times\mathbf{n}}\) is assumed to have full row-rank. Successive differentiations of Eq. (3) yield velocity and acceleration level constraints,

$$\begin{aligned} \dot{\pmb{\varPhi}} ( \mathbf{q}, \mathbf{v}, {t} ) =& \pmb{ \varPhi}_{\mathbf{q}} \mathbf{v} + \pmb{\varPhi}_{t} = \pmb{ \varPhi}_{\mathbf{q}} \mathbf{v} - \pmb{\nu} = \mathbf{0}, \end{aligned}$$
(4)
$$\begin{aligned} \ddot{\pmb{\varPhi}} ( \mathbf{q}, \mathbf{v}, \mathbf {a}, {t} ) =& \pmb{\varPhi}_{\mathbf{q}} \mathbf{a} + \frac{d}{dt} ( \pmb{\varPhi }_{\mathbf{q}} ) \mathbf{v} + \pmb{\varPhi}_{tt} = \pmb{ \varPhi}_{\mathbf{q}} \mathbf{a} - \pmb{\gamma} = \mathbf{0}, \end{aligned}$$
(5)

where \(\pmb{\nu}=-\pmb{\varPhi}_{t}\) and \(\pmb{\gamma}=- ( \frac {d}{dt} ( \pmb{\varPhi}_{\mathbf{q}} ) \mathbf{v} + \pmb{\varPhi }_{tt} )\). Equations (1) to (5) comprise a system of over-determined differential-algebraic equations (ODAE). An algorithm based on backward differentiation formulas (BDF) to solve ODAE is described as

$$ \mathbf{H(x)=} \left [ \textstyle\begin{array}{c} \mathbf{F(x)}\\ \ddot{\pmb{\varPhi}} \\ \dot{\pmb{\varPhi}} \\ \pmb{\varPhi} \\ \mathbf{U}_{1}^{\text{T}} ( {h'} \mathbf{R}_{1} ) \\ \mathbf{U}_{2}^{\text{T}} ( {h'} \mathbf{R}_{2} ) \end{array}\displaystyle \right ] \mathbf{=} \left [ \textstyle\begin{array}{c} \mathbf{F} ( \mathbf{q}, \mathbf{v}, \mathbf{a}, \pmb{\lambda}) \\ \pmb{\varPhi}_{\mathbf{q}} \mathbf{a} - \pmb{\gamma} \\ \pmb{\varPhi}_{\mathbf{q}} \mathbf{v} - \pmb{\nu} \\ \pmb{\varPhi} ( \mathbf{q}, {t} ) \\ \mathbf{U}_{1}^{\text{T}} ( {h'} \mathbf{a} - \mathbf{v} - \pmb{\zeta}_{1} ) \\ \mathbf{U}_{2}^{\text{T}} ( {h'} \mathbf{v} - \mathbf{q} - \pmb{\zeta}_{2} ) \end{array}\displaystyle \right ] \mathbf{= 0} $$
(6)

where \({h'}=\frac{h}{b_{0}}\), \(\pmb{\zeta}_{1} \equiv\frac{1}{b_{0}} \sum_{i=1}^{k} \mathbf{b}_{i} \mathbf{v}_{n-i} \) and \(\pmb{\zeta}_{2} \equiv\frac{1}{b_{0}} \sum_{i=1}^{k} \mathbf{b}_{i} \mathbf{q}_{n-i}\) in which \(\mathbf{k}\) is the order of integration, and the \(\mathbf{b}_{i}\) are BDF coefficients. \(\mathbf{x} = \left [ \begin{array}{cccc} \pmb{\lambda}^{\text{T}} & \mathbf{a}^{\text{T}} & \mathbf{v}^{\text{T}} & \mathbf{q}^{\text{T}} \end{array} \right ]^{\text{T}} \) and the columns of \(\mathbf{U}_{i} \in\mathbf{R}^{n \times ( n - m )}\) (\(i = 1, 2\)) constitute bases for the parameter space of the position and velocity level constraints. The matrices \(\mathbf{U}_{i}\) are chosen so that \(\bigl[{\scriptsize\begin{matrix}{} \pmb{\varPhi}_{\mathbf{q}} \cr \mathbf{U}_{i}^{\text{T}} \end{matrix}} \bigr]\) has an inverse. Therefore, the parameter space spanned by the columns of \(\mathbf{U}_{i}\) and the subspace spanned by the columns of \(\pmb{\varPhi}_{\mathbf{q}}^{\text{T}}\) constitute the entire space \(\mathbf{R^{n}}\).

Equation (6) can be solved since the number of equations and the unknowns are the same. Newton’s numerical method can be applied to acquire the solution \(\mathbf{x}\),

$$\begin{aligned} \mathbf{H}_{\mathbf{x}}^{\text{i}} \Delta\mathbf{x}^{\text{i}} =& -\mathbf{H}^{\text{i}} , \end{aligned}$$
(7)
$$\begin{aligned} \mathbf{x}^{\text{i+1}} =& \mathbf{x}^{\text{i}} + \Delta \mathbf{x}^{\text{i}}. \end{aligned}$$
(8)

3 Nested dissection

Equation (7) is a typical linear equation \(\mathbf{Ax}=\mathbf {b}\) problem. It is necessary to factorize the matrix \(\mathbf{A}\) to obtain a solution \(\mathbf{x}\). Four steps are needed to obtain the \(\mathbf{x}\) effectively: ① define a sparse matrix structure, ② reorder the matrix to obtain a permuted vector and symbolically factorize, ③ numerically factorize the matrix, and ④ acquire the solution with a right hand side by for- and backward substitutions. Among the four steps, the third numerical factorization step is usually the most time-consuming part. The reordering traversal in the second step has considerable impact on the performance of the third step. Therefore, the traditional nested dissection algorithm has been reviewed and then a new nested dissection algorithm is proposed.

When there are some connections among nodes composing a mechanical FE model, the data structure of the connections is defined as a graph (\(G\)). In the graph data structure, a node is called vertex (\(V\)) and its connection is called edge (\(E\)) [21]. If there is a set of vertices which can be divided into two sub-graphs with the relatively same size, the set of vertices is called ‘separator’ and the division operation is defined as ‘graph bisection’ as shown in Fig. 1a. The graph bisection operation creates a typical binary-tree structure. The bisected sub-graphs (\(G_{1}\), \(G_{2}\)) are independent. However, each sub-graph depends on its parent separator. When the graph bisection operations are applied recursively, it is defined as ‘nested dissection’ [22]. Figure 1b expresses a tree and a graph region of nested dissection up to three depths as a representative model for this paper.

Fig. 1
figure 1

Graph bisection and nested dissection

All sub-graphs including a root are called separator in this research. The highest one is called a root separator and the lowest ones are called leaf separators. The separators in the same depth are independent, while a separator and its higher depth separators have dependency.

3.1 Traditional DFS-based nested dissection

The DFS post-order traversal has been traditionally used to obtain a binary tree for a graph. The tree traversal of the DFS post-order is as follows [21].

  1. (1)

    Move to a child separator until there is no lower child separator.

  2. (2)

    Mark the separator and move to the other children separators of my parent.

  3. (3)

    Move to and mark my parent separator if all children separators are visited.

  4. (4)

    Repeat processes from (1) to (3) until there are no more unvisited separators.

The multilevel tree of Fig. 2a shows the visited numbers of Fig. 1b vertices, based on the DFS traversal. Figure 2b depicts the region number of 2a, and Fig. 2c illustrates relations of nodes from 2a and regions from 2b as a matrix form.

Fig. 2
figure 2

Results of the DFS post-order traversal

When visiting a certain separator, all connected descendents of the separator must be already visited in the traversal. For example, the seventh separator in Fig. 2a has to be visited after visiting the first to the sixth separators. This post-ordering improves memory locality during numerical factorization [23]. The algorithm has been representatively implemented at the ‘METIS_NodeND’ routine of METIS library [24].

3.2 Proposed BFS-based nested dissection

A proposed nested dissection in this research assigns the number based on the BFS reverse-level order traversal. The tree traversal of the BFS reverse-level order follows the rules below [21].

  1. (1)

    Move to a child separator until there is no more child separator.

  2. (2)

    Mark my separator and move to the other sibling separators.

  3. (3)

    Move to and mark a parent separator when all sibling separators are visited.

  4. (4)

    Repeat processes from (1) to (3) until there is no more unvisited separator.

The multilevel tree of Fig. 3a shows the visited numbers of Fig. 1b vertices, based on the BFS traversal. Figure 3b depicts the region number of 3a, and Fig. 3c illustrates relations of nodes from 3a and regions from 3b as a matrix form.

Fig. 3
figure 3

Results of the BFS reverse-level order

Since a separator is visited after its lower depth separators, it is guaranteed that its sibling independent separators are located nearby. For example, all separators of the third depth in Fig. 3a must be visited before one of the separators from the ninth to the twelfth ones which belong to the second depth.

Although this rule has no effect on the number of non-zeros, compared to the traditional DFS-based nested dissection, it allows better parallelism and flexible control of required data size for an additional computing device. We developed this algorithm by recursively calling ‘METIS_ComputeVertexSeparator’ routine in the METIS library [24]. The routine provides vertex indices of the separator and the two sub-graphs whenever there is a call. Therefore, it is not necessary to identify supernodes from a permuted sparse matrix as a post-process.

4 Numerical factorization

This section reviews the conventional supernodal and multifrontal methods and presents the benefits of the proposed method, based on the BFS traversal method.

4.1 The supernodal methods

The supernodal method commonly used to refer to left- and right-looking methods. The two methods equally consist of two sorts of operations. One is to factorize the diagonal block of a separator, and then for- and back-substitute the same row blocks of the separator, which is defined as ‘variable factor’ in this research. The other is to update a parent separator from its descendants, which is defined as ‘variable update’ in this research [25].

The left-looking methods start by applying variable updates from all descendant separators in the elimination tree to a separator before factorizing the separator. It delays the variable updates as much as possible. While right-looking methods factorize a local separator first, it performs the variable updates for ancestors of the local separator as fast as possible. Figure 4 depicts left- and right-looking as perspectives of a tree and a matrix.

Fig. 4
figure 4

Data access pattern for left- and right-looking methods

The supernodal methods build a computational structure by performing symbolic factorization before actual numerical factorization. Since fixed memory location for the variable updates are used, the supernodal methods use less memory than the multifrontal methods do.

4.2 The multifrontal methods

The computational sequence of the multifrontal method is determined by an assembly tree structure. The actual computation is performed by combining adjacent columns using a supernodal technology. The multifrontal method is based on continuous Gaussian elimination of small dense matrices called a frontal matrix. The small dense matrices in a factorization process act as a vertices of the assembly tree.

In Fig. 5a, the adjacent (1, 2), (3, 4), (5, 6) nodes are combined by a supernodal technology and they are defined as ①, ② and ③ separators, respectively. It also shows the corresponding frontal matrices and their relationships. A frontal matrix can be decomposed into two blocks as shown in Fig. 5b. One is a factor block consisting of eliminating variables, and the other is a contribution block composed of updating variables in the frontal matrix. Once conducting a numerical factorization of children frontal matrices, the updated contribution blocks are merged into their parent frontal matrices [26]. Therefore, the frontal factorization algorithm has a constraint that parent separators can be factorized out only after all children separators are factorized. For example, it is possible for the frontal matrices for ① and ② separators in Fig. 5a to be factorized at the same time, and then the updated contribution blocks can be merged into the parent separator ③. Due to the constraint, it requires additional temporal storage of stack structure to save the contribution blocks of children separators during a frontal numerical factorization [25].

Fig. 5
figure 5

A sample sparse matrix, its assembly tree and three memory spaces

The multifrontal method requires three kinds of memory spaces during a numerical factorization. The first space is used to store factored blocks; the second space is for saving contribution blocks of a stack structure; the third space is needed to operate a current frontal matrix [27]. The first space increases continuously during a factorization process and the third space is reused throughout the process. The second space stores the stacked contribution blocks. Since the stack size depends on the structure of assembly trees, it is not easy to estimate the exact size of the second space [25]. Figures 6a and 6b illustrate an assembly tree along with its frontal matrices of Fig. 2a and the memory transition as proceeding with the factorization steps. The multifrontal factorization steps are as follows,

  1. (1)

    Create a space to save a frontal matrix for a current separator.

  2. (2)

    Numerically factorize the current frontal matrix (see center of Fig. 6b) after merging contribution blocks from children frontal matrices.

    Fig. 6
    figure 6

    Assembly tree and its memory transition using the original method

  3. (3)

    Save the factored block in the factor saving space (see left side of Fig. 6b), the contribution block in the stacked space (see right side of Fig. 6b) and then release the current frontal matrix space.

  4. (4)

    Repeat steps from (1) to (3) until completing the numerical factorization of the root separator.

The memory size of factored blocks gradually increases as the numerical factorization proceeds, whereas the memory size of contribution blocks fluctuates irregularly as shown in Fig. 6c. The fluctuating memory usage often causes a failure of the numerical factorization due to an excess of available memory space [28].

4.3 The proposed implementation of the multifrontal method

This research implements the multifrontal method with the BFS-based nested dissection. The global frontal operation has been carried out so that it could have more parallel opportunities than multifrontal methods could. The multifrontal method involves one separator with its straight children separators per a frontal matrix, but the proposed method is aimed at all separators in both groups in a multilevel tree.

The variable factor and update operations are treated as one set operation in the conventional implementation of the multifrontal methods. The proposed implementation method divides the set of the variable factor and update operations into two independent operations. Since all factor operations for all separators in the same depth are independent, they can be carried out in parallel, depending on their available memory and processors. Similarly, all update operations for ancestor separators can be done independently; they can be carried out in parallel as well. The independent nature of the proposed implementation method gives more flexible scheduling and parallelism.

Figure 7 illustrates an operational sequence of the proposed method for the multilevel tree in Fig. 1b. The multilevel-tree separators of the first stage are divided into global factor and contribution groups. If the third depth separators are members of a global factor group, all upper separators are considered as those of a global contribution group. The update operations for the contribution group have to be done after carrying out variable factor operations for the factor group such as multifrontal methods. After the variable factor and update operations of the first stage, the separators of the third depth will not be involved in the rest of operations anymore. Therefore, the separators of the second depth become the next factor group and similar processes will be repeated as the second stage. This process will continue until reaching the root depth.

Fig. 7
figure 7

A proposed numerical factorization process

The frontal matrices of Fig. 8a are the same as those of frontal matrices of Fig. 6a. The only difference is the multifrontal factorization sequence. The form of the assembled blocks in Fig. 8b is almost identical to the factor and contribution blocks in Fig. 5b. The operation sequence of the proposed method is exactly the same as the BFS reverse-level order traversal in Fig. 3a. As a result, if a symbolic factorization is performed before an actual numerical factorization, it is possible to predict the necessary block data size and the locations including fill-in blocks. Thus, the required memory allocation can be done only once and it can be used until the completion of the factorization, as shown in Fig. 8c. This feature is highly similar to that of the supernodal method. The accurate prediction of the required data size makes it possible to discontinuing a numerical factorization due to a lack of host memory size. It could be a problem when the memory for the contribution blocks fluctuates irregularly [28] in the conventional implementation of the multifrontal methods.

Fig. 8
figure 8

Multifrontal factorization process using the proposed method

Separators in a global contribution group need descendent separators that have been already factorized in order to carry out variable update operations. Since the connectivity between variable factor and update operations has been divided, every separator of a global contribution group can independently refer to their connecting descendant separators. As a result, it brings more parallel opportunities by comparison with supernodal or multifrontal methods. Figure 9a shows an independent feature of each variable operation at the first stage.

Fig. 9
figure 9

Factorization process on the first stage and comparison with supernodal method

The independent feature looks as if this method were an either left- or right-looking methods. The proposed method does not belong to the two methods because the separators in each variable operation are independent. Figure 9b compares the proposed method with supernodal methods.

The proposed method also makes it possible to adjust required data size adaptively. GPU devices have different memory sizes. Therefore, it is necessary to adjust an operational region to limit the data size for a GPU device during a numerical factorization. This research proposes to set priorities on sibling and depth directions to adjust the operational regions. Figure 10 describes the concept of deciding priorities on two directions.

Fig. 10
figure 10

Two kinds of priorities to adjust operating regions

The priority of the sibling direction allows controlling the size of a global factor group to be handled at a time, depending on the available memory size. Since the global contribution group is automatically determined by the factor group size, the computational algorithm is highly flexible. The priority of the depth direction controls only the size of a global contribution group. Therefore, the depth priority strategy is less effective in controlling the data size than the sibling direction is. The sibling direction is attempted first, and then the depth direction is followed for fine tuning in the actual implementation. The two priority strategies allow the proposed implementation to be highly adaptive for diverse GPU memory sizes. The same method can be implemented with the DFS-based nested dissection. However, it is not as efficient as the BFS method because the DFS-based method must visit a lot of other separators to identify the same depth, while it is important to seek sibling separators as fast as possible in the proposed strategies. The outstanding prediction and adjustability features of the proposed implementation method make scatter maps inside a GPU device unnecessary [18].

5 Implementation for a GPU device

This section reviews the features of a GPU and a CPU. Both the original and the proposed multifrontal methods are considered for their implementations on a GPU device. The advantages of the proposed implementation are presented. In addition, optimization strategies are presented to maximize the performance on a GPU device.

5.1 Characteristic of a GPU device

The hardware specifications used for this research are shown in Table 1.

Table 1 Hardware specifications

The memory size of a GPU is smaller than that of a host side in most systems. The experimental computer system has 128 GB of memory while the GPU device contains only 6 GB. This difference between the host and the GPU memory size requires frequent data transfers. Moreover, the speed of PCI-Express link (12 GB/s) between CPU and GPU is much slower than that of inside communication in each computing device (CPU: 43.6 GB/s, GPU: 336 GB/s). Thus, the slow link speed has to be considered in maximizing the performance of the GPU.

The experimental GPU device, NVIDIA GeForce GTX TITAN BLACK, has Kepler GK110 architecture. A fully enabled Kepler GK110 consists of 15 SMX units. Each SMX unit includes 64 cores for double precision. Because mechanical dynamics solutions require considerably more precise solutions, the double precision data type is used for this research. Therefore, there are total 960 cores for double precision operations in the experimental GPU device [29]. Theoretical performance of a computing device can be estimated by multiplying the number of cores, core clock speed, SIMD and FMA. The theoretical value of the experimental CPU has 80 Gflop/s and the GPU is 1931.52 Gflop/s.

5.2 Application of a GPU device

The linear equation solver for a sparse matrix was divided into four steps of defining matrix structure, reordering or analyzing, numerical factorization and solving. The numerical factorization is the most time-consuming step among the 4 steps. A GPU device can be used to assist the numerical factorization step in this research, as shown in Fig. 11.

Fig. 11
figure 11

Roles of CPU and GPU devices for linear equation solver

As a view of memory management, it has been presented that the multifrontal method needs a flexible stack structure during runtime. Since the stack structure is required to have frequent memory operations for contribution blocks, the parallel efficiency on GPUs of the conventional multifrontal method is poor. In addition, the fluctuating memory usage may cause a GPU computation failure. Because of these drawbacks, the multifrontal method is not suitable for programming on GPUs. SPQR in SuiteSparse introduces a way to conduct an assembly and computation of each frontal matrix in a GPU device. In case of handling a very large frontal matrix, the library splits the trees into sub-trees within the GPU memory size. However, this attempt is sometimes not enough to prevent a failure from exceeding device memory size [19].

As a view of parallel execution, it is necessary to understand principles of a parallel methodology of a GPU device. Once tasks are added into each stream queue, the GMU (Grid Management Unit) newly introduced in Kepler GK110 architecture of a GeForce TITAN BLACK manages and prioritizes tasks to be executed. The GMU communicates with a CWD (CUDA Work Distributor) via a bidirectional link and also has a directional connection with SMX units to launch additional tasks via Dynamic Parallelism on the GPU [29].

Figure 12 expresses a workflow of Kepler architecture and the sequence of queuing tasks with respect to variable operations for all depths in Fig. 8b. The proposed method consists of a set of variable factor and update operations. All independent separators associated with the same depth are added to each stream queue and synchronize them until all operations are finalized. Once variable factor operations are completed and synchronized, all ascendant separators are scheduled to conduct variable update operations by referring prior separators already stored in the GPU device. This process will be repeated until reaching a root separator. Since there is no data dependency among the variable factors for the same depth, it is possible to parallelize the variable factor operations for all separators of the same depth. However, the computing time of the variable factors is generally even less than that of the variable updates. Therefore, it is very important to well-parallelize the variable update operations to obtain a good parallel performance.

Fig. 12
figure 12

An actual workflow of the proposed implementation in Fig. 8b

Update operation for a separator must receive data from their descendant separators which are already factorized at the previous variable factor operation and already has resided inside a GPU device. Besides, the variable update operations are independent of all separators.

Since the GPU has only a limited memory space, two cases must be considered depending on the required data size. The first case is when the required data size is equal to or less than an available GPU memory size. In this case, all data can be transferred from the CPU to the GPU at once at the beginning and the results can be transferred back at the end, which is called ‘FULL’ version in this research. The second case is when the required data size is bigger than the GPU memory size. The whole data cannot be transferred at once, so that the data must be divided into data smaller than an available GPU memory size. Only a part of all possible separators for variable factors and their associated ascendant separators for variable updates are handled at each time, which is called ‘PARTIAL’ version in this research.

The version type is decided right after a required data size is estimated from a symbolic factorization. The ‘PARTIAL’ version uses the two priority strategies in deciding partial operational separators. And then, computing commands are created just once and saved at the host memory space. If there is enough GPU memory space, the proposed implementation needs two operational synchronizations per one depth. Thus, it is required to synchronize the computing commands at least twice the number of deepest multi-tree depth during numerical factorization process. The number of actual synchronizations may increase in accordance with GPU memory size.

5.3 Optimization of a GPU device

The ‘FULL’ and ‘PARTIAL’ versions are presented in the previous section. This section presents how to implement the ‘PARTIAL’ version effectively. The ‘PARTIAL’ version requires frequent data exchanges to synchronize data. Since the variable factor and update operations are dependent of each other, any other numerical operation cannot be conducted while transferring data to GPU device. The PCI-Express link speed is very slow, compared to those of inside communications.

In order to overcome the slow transfer speed in using a GPU, the multi-streams of CUDA are applied. The main feature of the multi-streams is to divide large data to be transferred to some smaller data units. If a set of transfer task of a small data unit and its operation task are added to each stream, a GPU device executes transfer and operation tasks simultaneously from different streams in hardware level [30]. This makes it possible to compute huge data while minimizing time loss caused by slow transfer speed of PCI-Express link and to assign tasks continuously on GPUs with restricted memory space. One stream is equivalent to one separator in this research. Figure 13 explains a principle of multi-streams and its application for the proposed multifrontal method.

Fig. 13
figure 13

Partial factorization of the proposed multifrontal method

Note that it is recommended that the time to transfer data is almost equal to or smaller than the computing time on GPUs. Otherwise, the computing performance may deteriorate due to the waiting time. Therefore, the data size and its complexity of separators are very important to have a good parallel performance on a GPU device. This will be discussed in the next section.

6 Division of separators

A multilevel tree obtained by a nested dissection algorithm generally contains separators with different sizes. The root separator is the largest one in numerous cases and the children separators tend to be smaller than their parent. The difference in separator sizes causes an unbalanced overlapping between data transferring and computing time. This approach is similar to ‘tile algorithm’ which is widely used across linear algebra libraries. The tile algorithm shows outstanding computing performance on homogeneous multi-core CPUs [31] and many-core GPUs [32] as well as heterogeneous systems [33, 34]. This section presents the effects of the division of the separators and how to determine an optimum maximum block size.

6.1 Experimental models

The flops of variable factors and updates as well as the ratio of them are important in developing an efficient computing algorithm. Table 2a summarizes some BLAS and LAPACK routines used for the numerical factorization step and their approximate flops [35]. Several sparse matrices are selected from University of Florida Sparse Matrix Market to show the effects of the separator division. Though some of original sparse matrices are symmetric and contain only a lower side, an upper side from the original lower side has been filled for this experiment. Table 2b shows substantial flops of the proposed method without division of separators for each model. Note that the variable update operation takes about 80–90% of the total computation.

Table 2 BLAS, LAPACK routines and operation counts of various models

6.2 Effects of division

As a view of memory usage, division of separators acts as restraining the area of the arithmetic operations. Large adjacent dense blocks usually intervene in a variable update operation with original separators. The same operation with divided separators presents that the operational regions decline as well as the smaller blocks are involved in the operations.

Figure 14 shows the effects of the separators. \(\mathbf {A}_{1}\) has two adjacent separators \(\mathbf{A}_{5}\) and \(\mathbf {A}_{7}\) in Fig. 14a. The variable update operation of \(\mathbf{A}_{1}\) uhas two adjacent separators (\(\mathbf{A}_{5}\), \(\mathbf{A}_{7}\), \(\mathbf{A}_{57}\) and \(\mathbf{A}_{75}\)) in blocked matrix. Meanwhile, the variable update of smaller separators from \(^{1}\mathbf{A}_{1}\) to \(^{4}\mathbf{A}_{1}\) in Fig. 14b updates only a small part of the same region.

Fig. 14
figure 14

Fill-ins before and after the division of separators into a few blocks

As a view of parallel efficiency, the division of separator hides transfer time of required data from host to devices more efficiently. Figure 15 shows how much efficiency can be improved by dividing the separators in parallel processing the numerical factorization. Figure 15a is a case of an original frontal matrix with coarse-grained blocks, and Fig. 15b is a case of a divided frontal matrix into a specified size with fine-grained blocks. The undivided separators may create a small number of non-uniform variable update operations. Meanwhile, the divided separators may create a larger number of uniform separators than coarse-grained. The GPU computation always involves data copy between host and device. The non-uniformed data sizes and their operations cause delays of GPU computing and poor parallel performance. Figure 15c obtained from NVIDIA visual profiler (nvvp) illustrates two computational timelines with respect to Figs. 15a and 15b, respectively. The ocher bars are data copy tasks from host to device and the blue bars describe matrix multiplication operations. The two timelines have the same flops and time-scale under a multi-streams environment.

Fig. 15
figure 15

Improvement of parallel efficiency by division of separators

The bottom line is that since the overlapping time in coarse-grained frontal matrix is not well-fitted, there will be a drop in parallel efficiency. The fine-grained timeline, however, shows that most of computing and copy-task time are overlapped together. The variable update operations take up most of the computing time; the division of separators is highly important in GPU computation.

6.3 Optimum maximum block size

The previous section has shown that division of separators improves the parallel efficiency. A block size must be decided to achieve the most efficient parallelism. The factorization time, memory usage and flops have been measured to identify the effects of the maximum block size. Table 4 shows the measured numerical values with numerical factorization type for models in Table 2b when the block size is changed from 64 to 4096. And the same kinds of results from other linear solver routines on CPU and GPU devices are also appended to Table 4. The additional figures are obtained from GPU versions of CHOLMOD, SPQR included in SuiteSparse and DSS in MKL [36]. The software specifications and reordering algorithms for each linear solver routine are tabulated in Table 3.

Table 3 Software specifications and reordering algorithms

The number of floating-point operations and the amount of peak memory tend to grow up constantly as the maximum block size increases. The computing time, however, decreases to a certain block size and increases again. The size for the minimum computing time has been found to be in the range of 512–1024. Other matrices have been numerically factorized and have shown a similar behavior. Though the optimum block size of relatively smaller matrices has been found to be 512, the time difference between 512 and 1024 is not big. As a result, the optimum maximum block size is 1024 in this research. However, this block size cannot be applied when a GPU does not have enough memory space to fully store two block rows. In this case, the maximum block size must be reduced so that it does not exceed the GPU memory space.

Although these research results named MFS (MultiFrontal Solver) do not always draw the best performance, most numerical factorization times are similar to or faster than those of CHOLMOD on the same experimental GPU device. Meanwhile, the SPQR sometimes produces wrong consequences of NaN (Not a Number) or fails showing a ‘GPU memory too small’ error message.

7 Numerical experiments

Dynamic analysis of three flexible mechanical models has been carried out for 1 second with 100 output step by using the proposed method with a GPU. Figure 16 shows experimental model shapes and nodes to verify analysis results. Each node represents position of the maximum Von Mises stress value among all nodes over the whole analysis process for each model.

Fig. 16
figure 16

Three flexible mechanical models for dynamic analysis

The first ‘Crank piston’ model is a part of engine components. The crank shaft rotates about the ground. Each piston and the shaft are connected to connecting rods that are flexible bodies. Piston translates with respect to the ground.

The second ‘Suspension’ model is an ordinary double wishbone. Upper and lower control arms rotate with respect to the ground. The upper shocks are fixed to the ground by bushing elements. The upper and lower shocks are connected by a translational joint and spring. The upper and lower control arms are connected by a knuckle which is connected by a revolute joint to the tire. A rack body controls the steering motion of the knuckle. Horizontal motion of the rack and vertical motion of tires are given for the dynamic analysis to observe the knuckle kinematics and compliance characteristics. The lower control arm is modeled as a flexible body.

The last ‘Bell crank’ model has a fixed center hole and there are two upper and lower holes at the end of each arm. Dynamic forces are applied at two hole-faces in the opposite direction. Boundary condition is used to fasten the body. Standard elastic steel properties are used for all bodies of the models except tires in the second model.

A variable step size has been used to numerically integrate Eq. (6) and the \(\mathbf{H_{x}^{i}}\) matrix in Eq. (7) must be generated and solved at every time step. There are numerical errors due to finite-precision arithmetic and condition number of the linear system [25], and a dynamic analysis accumulates the errors over time. Accurate solutions have a decisive effect on iterations of Newton’s numerical method as well as on overall dynamic analysis time. Thus, the iterative refinement option is used to satisfy the accuracy requirement during the solving phase. In order to verify an operational performance of the proposed implementation, DSS is used. Except for the operating system, the same software specifications and reordering algorithm in Table 3b are used again because the dynamic analysis software supports only the Windows operation system. Therefore, CHOLMOD and SPQR routines cannot be involved in this experiment. The number of linear solving steps is the same irrespective of the two routines.

In Table 5, peak memory usage and detailed computing times with their proportions are tabulated. The times are categorized into two parts of the linear equation solver times including reordering, numerical factorization and solving phase and the remaining time for reading input file, converting required data structure and generating system matrix at every time step.

The GPU device is used only in the numerical factorization step, so that the device has outstanding effects on the linear solver time while there is no effect in the other time. Although there are considerable speed-ups in numerical factorization step, the total time improvement ratio cannot surpass each factorization speed-up rate due to the Amdahl law [37]. As a result, the aggregate time of the GPU is accelerated about 1.9 to 4.7 times, a little less than 2.5 to 5.9 times for the factorization improvement, compared to that of the CPU.

The flops of variable update operation occupy most of the total numerical factorization. Since the variable update consists of only matrix multiplication operations, it is important to analyze the matrix multiplication to achieve the best computing performance. Many different sizes of non-square matrices are involved in the variable update. Because it is difficult to estimate the computing time for all kinds of matrices, all matrices must be converted into computationally equivalent square matrices. Figure 17 depicts two kinds of data from one of the sparse matrices generated during dynamic analysis. One depicts both a constant theoretical peak and an experimental Gflop/s of square matrix multiplications in double precision when the matrix size increases. The other shows the percentages of equivalent square matrix sizes whose arithmetic operation numbers are from the Fig. 16 models.

Fig. 17
figure 17

A distribution of equivalent square matrix sizes and its Gflop/s

Based on the square matrix size 128 in Fig. 17, the experimental models present different tendencies. Below 128, the first and the second models have higher proportions than the third model; however, the third model shows more percentages above 128, especially 512–1024. The experimental performance for matrix multiplication consistently grows until 2048. Therefore, the outstanding computing performance of the GPU device is highly affected by the ratios of larger equivalent square matrix sizes.

Meanwhile, the peak memory usage of the proposed method using a GPU device is greater than that of MKL DSS on a CPU. The memory usage is affiliated with the block size of separators as shown in Table 4. There is an inverse relationship between block size and factorization time below a certain block size, or 1024. It is possible to reduce the memory usage by reducing the maximum block size, but the action certainly causes increase in computing time for numerical factorization. This trend seems to be the nature of the experimental GPU architecture. Table 6 summarizes the computing time and the memory usage of one sparse matrix among the dynamic analysis according to the maximum block size from 64 to 1024.

Table 4 Changes of time, memory usage and flops of Table 2b models
Table 5 Dynamic analysis time and memory usage results for each model
Table 6 A correlation between memory usage and computing time for block size

It is also important to compare the solution accuracies of GPU and CPU. Von Mises stress and acceleration magnitude values at the point shown in Fig. 17 are used to check the solution accuracy. The scaled Infinity-norm values in Eq. (9) are compared with evaluation of the accuracy. Table 7 shows that the norm values are almost identical, which verifies the solution accuracy.

$$ \text{res}_{\text{scaled}} = \frac{\Vert \mathbf{x}_{\text{cpu}} - \mathbf{x}_{\text{gpu}} \Vert _{\infty}}{ \Vert \mathbf{x}_{\text{cpu}} \Vert _{\infty} + \Vert \mathbf{x}_{\text{gpu}} \Vert _{\infty}} $$
(9)
Table 7 The scaled residual values for each solution type

8 Conclusion

A new linear equation solver for a GPU has been implemented using a BFS-based reordering and multifrontal methods. The proposed implementation is parallelized for an experimental GPU. Since popular direct methods have several drawbacks to apply them to GPUs, a new implementation is needed. In order to get over the drawbacks, a combination of the BFS-based reordering method and multifrontal method is proposed. A global multifrontal operation is carried out from the deepest separators to a root of a multilevel tree. This sequence is exactly the same as the BFS reverse-level order traversal. The BFS-based implementation is more suitable for the GPU device than the DFS-based. It makes a host system easy to set priorities of operational regions to fit the GPU memory size. The operation grouping of variable factor and update gives separators more parallel opportunities. However, another difficulty with the efficient parallel processing comes from the non-uniform size of separators. To resolve the difficulty, large separators are divided into smaller blocks. An experimental approach is used to estimate the optimum maximum block size to be 512 or 1024. Dynamic analysis of three mechanical models has been carried out to demonstrate the effectiveness of the proposed method. The computing time and the memory usage on GPUs are compared with those obtained from DSS routine included in the MKL on CPUs. It performs 1.9–4.7 times faster during the whole computing process. The important factor in deciding the performance improvement on a GPU device is how many percentages of large block matrices in variable update operations are involved. The proposed implementation and the DSS have yielded the same level of accurate solutions. The proposed method will be extended for a multi-GPU system.