Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In a number of science and engineering applications, researchers are faced with the task of solving large numbers of independent systems of ordinary differential equations (ODEs). One prominent example is the simulation of reactive flows for modeling combustion [5, 15, 25, 27, 28], atmospheric chemistry [1, 13], and groundwater transport [2, 3], where operator splitting [31, 33] decouples the solution of the fluid transport (e.g., advection, diffusion) and chemical kinetics terms. This results in large numbers of independent ODEs for the conservation of chemical species masses, with one system for each spatial location. The solution of the aggregate of these ODEs consumes most of the total wall-clock time of such simulations, >90 % in some cases. Simulations of electrical behavior in cardiac tissue use a similar operator splitting technique, which results in ODE systems for cell membrane dynamics [23, 34]. Other areas that deal with solving many independent systems of ODEs include systems biology [6, 40] and Monte Carlo methods for sensitivity analysis of ODEs [10, 16, 18].

In such problems, large numbers of the same governing ODEs with different initial conditions and/or input parameters must be integrated; since each system is independent, the entire set of ODEs can be integrated concurrently. While performance can be improved by using parallel central processing unit (CPU) methods, this embarrassingly parallel problem is especially well-suited to acceleration with the thread-based parallelism of graphics processing units (GPUs), as demonstrated for reactive-flow simulations [21, 22, 29, 32]. In particular, Niemeyer and Sung [21] recently developed a GPU-based approach for the integration of moderately stiff chemical kinetics ODEs using explicit integration algorithms, using an adaptive fifth-order Runge–Kutta–Cash–Karp (RKCK) method for nonstiff cases and a stabilized second-order Runge–Kutta–Chebyshev (RKC) method for problems with greater stiffness. For large numbers of ODEs, they demonstrated that the GPU-based RKCK and RKC algorithms performed up to 126 and 65 times faster, respectively, than CPU versions of the same algorithm on a single CPU core. Furthermore, with moderate levels of stiffness, the GPU-based RKC offered a speedup factor of 57 compared to a conventional implicit algorithm executed in parallel on a six-core CPU. The specific acceleration factor demonstrated depended on the problem studied (e.g., the kinetic mechanism) and number of ODEs considered. Due to the favorable performance of these methods, in this chapter we present the integration algorithms, associated GPU source code, and implementation details so that interested readers can apply them to more general applications (e.g., the areas described above).

2 Mathematical Background

In this chapter, we represent a generic system of ODEs using

$$\displaystyle{ \frac{d\mathbf{y}} {dt} =\mathbf{ f}\left (t,\mathbf{y}(t),\mathbf{g}\right ), }$$
(8.1)

where \(\mathbf{y}(t)\) is the vector of unknown dependent variables at some time t, \(\mathbf{f}\) is the right-hand-side vector function, and \(\mathbf{g}\) is a vector of constant parameters (e.g., pressure or temperature for chemical kinetics). The size of \(\mathbf{y}\) (i.e., the number of equations/unknowns) is N. For the types of problems with which we are concerned here, a large number of ODE systems, N ode, each given by Eq. (8.1) must be integrated independently from some time t = t n to t n+1, with different initial conditions \(\mathbf{y}\left (t_{n}\right )\) and constant parameters \(\mathbf{g}\) for each system. The numerical approximation to the exact solution \(\mathbf{y}(t_{n})\) is \(\mathbf{y}_{n}\), and the step size for a given step is \(\delta t_{n} = t_{n+1} - t_{n}\).

Nonstiff ODEs, or those with little stiffness, can be solved using explicit integration methods. Many such methods exist, and algorithms can be classified in general into Runge–Kutta and linear multistep methods, and also into explicit or implicit methods; see Hairer and Wanner [9] for more details. Stiffness, a concept somewhat difficult to quantify, refers to the quality of an ODE making standard explicit methods perform poorly due to the requirement for unreasonably small time-step sizes—otherwise the solutions become unstable and oscillate wildly [8].

Traditionally, implicit integration algorithms such as those based on backwards difference formulas have been used to handle stiff equations, but these require expensive linear algebra operations on the Jacobian matrix (e.g., LU decomposition, matrix inversion). The complex logical flow of such operations results in highly divergent instructions for different initial conditions, making implicit algorithms unsuitable for operation on GPUs. In fact, Stone and Davis [32] implemented a traditional high-order implicit algorithm on GPUs, and found that it performed only slightly better than a multi-core CPU version of the same algorithm would. While implicit algorithms may be required for ODE systems with extreme levels of stiffness (suggesting that new solutions need to be found for GPU acceleration of such problems), other options can be used for cases of little-to-moderate stiffness. Here, we describe two integration algorithms suitable for use solving many independent systems of ODEs on GPUs.

2.1 Runge–Kutta–Cash–Karp

For nonstiff ODEs, an appropriate explicit algorithm is the fifth-order Runge–Kutta method developed by Cash and Karp [4]: the RKCK method. The RKCK method estimates the local truncation error using an embedded fourth-order method, by taking the difference between the fourth- and fifth-order solutions. It then uses this estimate to adaptively select the step size [26].

Using the terminology established above, the RKCK formulas—which also apply to any general fifth-order Runge–Kutta method—are

$$\displaystyle\begin{array}{rcl} {\boldsymbol k_{1}} =\delta t_{n}\,\mathbf{f}\left (t_{n},\mathbf{y}_{n},\mathbf{g}\right )\;,& &{}\end{array}$$
(8.2)
$$\displaystyle\begin{array}{rcl} \mathbf{k}_{2} =\delta t_{n}\,\mathbf{f}\left (t_{n} + a_{2}\,\delta t_{n},\mathbf{y}_{n} + b_{21}\mathbf{k}_{1},\mathbf{g}\right )\;,& &{}\end{array}$$
(8.3)
$$\displaystyle\begin{array}{rcl} \mathbf{k}_{3} =\delta t_{n}\,\mathbf{f}\left (t_{n} + a_{3}\,\delta t_{n},\mathbf{y}_{n} + b_{31}\mathbf{k}_{1} + b_{32}\mathbf{k}_{2},\mathbf{g}\right )\;,& &{}\end{array}$$
(8.4)
$$\displaystyle\begin{array}{rcl} \mathbf{k}_{4} =\delta t_{n}\,\mathbf{f}\left (t_{n} + a_{4}\,\delta t_{n},\mathbf{y}_{n} + b_{41}\mathbf{k}_{1} + b_{42}\mathbf{k}_{2} + b_{43}\mathbf{k}_{3},\mathbf{g}\right )\;,& &{}\end{array}$$
(8.5)
$$\displaystyle\begin{array}{rcl} \mathbf{k}_{5} =\delta t_{n}\,\mathbf{f}\left (t_{n} + a_{5}\,\delta t_{n},\mathbf{y}_{n} + b_{51}\mathbf{k}_{1} + b_{52}\mathbf{k}_{2} + b_{53}\mathbf{k}_{3} + b_{54}\mathbf{k}_{4},\mathbf{g}\right )\;,& &{}\end{array}$$
(8.6)
$$\displaystyle\begin{array}{rcl} \mathbf{k}_{6} =\delta t_{n}\,\mathbf{f}\left (t_{n} + a_{6}\,\delta t_{n},\mathbf{y}_{n} + b_{61}\mathbf{k}_{1} + b_{62}\mathbf{k}_{2} + b_{63}\mathbf{k}_{3} + b_{64}\mathbf{k}_{4} + b_{65}\mathbf{k}_{5},\mathbf{g}\right )\;,& &{}\end{array}$$
(8.7)
$$\displaystyle\begin{array}{rcl} \mathbf{y}_{n+1} =\mathbf{ y}_{n} + c_{1}\mathbf{k}_{1} + c_{2}\mathbf{k}_{2} + c_{3}\mathbf{k}_{3} + c_{4}\mathbf{k}_{4} + c_{5}\mathbf{k}_{5} + c_{6}\mathbf{k}_{6}\;,& &{}\end{array}$$
(8.8)
$$\displaystyle\begin{array}{rcl} \mathbf{y}_{n+1}^{{\ast}} =\mathbf{ y}_{ n} + c_{1}^{{\ast}}\mathbf{k}_{ 1} + c_{2}^{{\ast}}\mathbf{k}_{ 2} + c_{3}^{{\ast}}\mathbf{k}_{ 3} + c_{4}^{{\ast}}\mathbf{k}_{ 4} + c_{5}^{{\ast}}\mathbf{k}_{ 5} + c_{6}^{{\ast}}\mathbf{k}_{ 6}\;,& &{}\end{array}$$
(8.9)

where \(\mathbf{y}_{n+1}\) is the fifth-order solution and \(\mathbf{y}_{n+1}^{{\ast}}\) is the solution of the embedded fourth-order method. The RKCK coefficients are given in Table 8.1. The local error \(\boldsymbol{\Delta }_{n+1}\) is estimated using the difference between the fourth- and fifth-order solutions:

$$\displaystyle{ \boldsymbol{\Delta }_{n+1} =\mathbf{ y}_{n+1} -\mathbf{ y}_{n+1}^{{\ast}} =\sum _{ i=1}^{6}\left (c_{ i} - c_{i}^{{\ast}}\right )\mathbf{k}_{ i}\;. }$$
(8.10)
Table 8.1 Coefficients for the fifth-order Runge–Kutta–Cash–Karp method, adopted from Press et al. [26]

Then, this error is compared against a desired accuracy, \(\boldsymbol{\Delta _{0}}\), defined by

$$\displaystyle{ \boldsymbol{\Delta }_{0} =\varepsilon \left (\vert \mathbf{y}_{n}\vert + \left \vert \delta t_{n}\,\mathbf{f}\left (t_{n},\mathbf{y}_{n},\mathbf{g}\right )\right \vert +\delta \right )\;, }$$
(8.11)

where \(\varepsilon\) is a tolerance level and δ represents a small value (e.g., 10−30). When the estimated error rises above the desired accuracy (\(\boldsymbol{\Delta }_{n+1} >\boldsymbol{ \Delta }_{0}\)), the algorithm rejects the current step and calculates a smaller step size. Correspondingly, the algorithm accepts a step with error at or below the desired accuracy (\(\boldsymbol{\Delta }_{n+1} \leq \boldsymbol{ \Delta }_{0}\)) and calculates a new step size for the next step using

$$\displaystyle{ \delta t_{\text{new}} = \left \{\begin{array}{@{}l@{\quad }l@{}} S\,\delta t_{n}\,\max _{i}\left (\left \vert \frac{\Delta _{0,i}} {\Delta _{n+1,i}}\right \vert \right )^{1/5}\;\text{if }\boldsymbol{\Delta _{n+1}} \leq \boldsymbol{ \Delta }_{0}\;,\text{ or}\quad \\ S\,\delta t_{n}\,\max _{i}\left (\left \vert \frac{\Delta _{0,i}} {\Delta _{n+1,i}}\right \vert \right )^{1/4}\;\text{if }\boldsymbol{\Delta _{n+1}} >\boldsymbol{ \Delta }_{0}\;. \quad \end{array} \right. }$$
(8.12)

Here, i represents the ith element of the related vector and S denotes a safety factor slightly smaller than unity (e.g., 0.9). Equation (8.12) is used to calculate the next time step size both for an accepted step and also for a new, smaller step size in the case of a step rejection. In practice, step size decreases and increases are limited to factors of 10 and 5, respectively [26].

2.2 Runge–Kutta–Chebyshev

For ODEs with moderate levels of stiffness, one alternative to implicit algorithms is a stabilized explicit scheme such as the RKC method [30, 3539]. While the RKC method is explicit, it handles stiffness through additional stages—past the first two required for second-order accuracy—that extend its stability domain along the negative real axis of eigenvalues.

Our RKC implementation is taken from Sommeijer et al. [30] and Verwer et al. [39]. Following the same terminology as above, the formulas for the second-order RKC are

$$\displaystyle\begin{array}{rcl} \mathbf{w}_{0} =\mathbf{ y}_{n}\;,& &{}\end{array}$$
(8.13)
$$\displaystyle\begin{array}{rcl} \mathbf{w}_{1} =\mathbf{ w}_{0} +\tilde{\mu } _{1}\,\delta t_{n}\,\mathbf{f}_{0}\;,& &{}\end{array}$$
(8.14)
$$\displaystyle\begin{array}{rcl} \mathbf{w}_{j}& =& (1 -\mu _{j} -\nu _{j})\mathbf{w}_{0} +\mu _{j}\mathbf{w}_{j-1} \\ & & +\:\nu _{j}\mathbf{w}_{j-2} +\tilde{\mu } _{j}\,\delta t_{n}\,\mathbf{f}_{j-1} +\tilde{\gamma } _{j}\,\delta t_{n}\,\mathbf{f}_{0},\quad j = 2,\mathop{\ldots },s\;,{}\end{array}$$
(8.15)
$$\displaystyle\begin{array}{rcl} \mathbf{y}_{n+1} =\mathbf{ w}_{s}\;,& &{}\end{array}$$
(8.16)

where s is the total number of stages and \(\mathbf{w}_{j}\) are internal vectors for the stages. The right-hand-side derivative vector function, \(\mathbf{f}_{j}\), is evaluated at each stage, such that \(\mathbf{f}_{j} =\mathbf{ f}(t_{n} + c_{j}\,\delta t_{n},\mathbf{w}_{j},\mathbf{g})\). The recursive nature of \(\mathbf{w}_{j}\) allows the use of only five arrays for storage. The coefficients used in Eqs. (8.14) and (8.15) can be obtained analytically for any number of stages s ≥ 2:

$$\displaystyle\begin{array}{rcl} \tilde{\mu }_{1} = b_{1}\omega _{1},\quad \mu _{j} = \frac{2b_{j}\omega _{0}} {b_{j-1}},\quad \nu _{j} = \frac{-b_{j}} {b_{j-2}},\quad \tilde{\mu }_{j} = \frac{2b_{j}\omega _{1}} {b_{j-1}}\;,& &{}\end{array}$$
(8.17)
$$\displaystyle\begin{array}{rcl} \tilde{\gamma }_{j} = -a_{j-1}\widetilde{\mu _{j}},\quad b_{0} = b_{2},\quad b_{1} = \frac{1} {\omega _{0}},\quad b_{j} = \frac{T_{j}^{{\prime\prime}}(\omega _{0})} {\left (T_{j}^{{\prime}}(\omega _{0})\right )^{2}}\;,& &{}\end{array}$$
(8.18)
$$\displaystyle\begin{array}{rcl} w_{0} = 1 + \frac{\kappa } {s^{2}},\quad \omega _{1} = \frac{T_{s}^{{\prime}}(\omega _{0})} {T_{s}^{{\prime\prime}}(\omega _{0})},\quad a_{j} = 1 - b_{j}T_{j}\left (\omega _{0}\right )\;,& &{}\end{array}$$
(8.19)

for \(j = 2,\mathop{\ldots },s\), where κ ≥ 0 is the damping parameter (e.g., \(\kappa = 2/13\) [30, 39]). The Chebyshev polynomials of the first kind, T j (x), with first and second derivatives T j (x) and T j ′ ′(x), respectively, are defined recursively as

$$\displaystyle{ T_{j}(x) = 2xT_{j-1}(x) - T_{j-2}(x),\quad j = 2,\mathop{\ldots },s\;, }$$
(8.20)

where T 0(x) = 1 and T 1(x) = x. The c j used in the function evaluations are

$$\displaystyle\begin{array}{rcl} c_{1} = \frac{c_{2}} {T_{2}^{{\prime}}(\omega _{0})} \approx \frac{c_{2}} {4} \;,& &{}\end{array}$$
(8.21)
$$\displaystyle\begin{array}{rcl} c_{j} = \frac{T_{s}^{{\prime}}(\omega _{0})} {T_{s}^{{\prime\prime}}(\omega _{0})} \frac{T_{j}^{{\prime\prime}}(\omega _{0})} {T_{j}^{{\prime}}(\omega _{0})} \approx \frac{j^{2} - 1} {s^{2} - 1},\quad 2 \leq j \leq s - 1\;,& &{}\end{array}$$
(8.22)
$$\displaystyle\begin{array}{rcl} c_{s} = 1\;.& &{}\end{array}$$
(8.23)

As with the RKCK method in Sect. 8.2.1, the RKC method can also be used with an adaptive time stepping method for error control, as given by Sommeijer et al. [30]. The error accrued in taking the step \(t_{n+1} = t_{n} +\delta t_{n}\) and obtaining \(\mathbf{y}_{n+1}\) is estimated using

$$\displaystyle{ \boldsymbol{\Delta }_{n+1} = \frac{4} {5}(\mathbf{y}_{n} -\mathbf{ y}_{n+1}) + \frac{2} {5}\delta t_{n}(\mathbf{f}_{n} +\mathbf{ f}_{n+1})\;. }$$
(8.24)

We then obtain the weighted RMS norm of error using this error estimate with absolute and relative tolerances:

$$\displaystyle\begin{array}{rcl} \|\boldsymbol{\Delta }_{n+1}\|_{\text{RMS}} = \left \|\frac{\boldsymbol{\Delta }_{n+1}} {\mathbf{T}\sqrt{N}}\right \|_{2}\;,& &{}\end{array}$$
(8.25)
$$\displaystyle\begin{array}{rcl} \mathbf{T} =\mathbf{ A} + R \cdot \max \left (\vert \mathbf{y}_{n}\vert,\vert \mathbf{y}_{n+1}\vert \right )\;,& &{}\end{array}$$
(8.26)

where N represents the number of unknown variables as defined previously, \(\mathbf{A}\) is the vector of absolute tolerances, and R is the relative tolerance. The norm \(\|\cdot \|_{2}\) indicates the Euclidean or L 2 norm. If \(\|\boldsymbol{\Delta }_{n+1}\|_{\text{RMS}} \leq 1\), the step is accepted; otherwise, it is rejected and repeated using a smaller step size. Finally, a new step size is calculated using the weighted RMS norm of error for the current and prior steps, as well as the associated step sizes, via

$$\displaystyle\begin{array}{rcl} \delta t_{n+1} =\min \left (10,\max (0.1,f)\right )\delta t_{n}\;,& &{}\end{array}$$
(8.27)
$$\displaystyle\begin{array}{rcl} f = 0.8\left ( \frac{\|\boldsymbol{\Delta }_{n}\|_{\text{RMS}}^{1/(p+1)}} {\|\boldsymbol{\Delta }_{n+1}\|_{\text{RMS}}^{1/(p+1)}} \frac{\delta t_{n}} {\delta t_{n-1}}\right ) \frac{1} {\|\boldsymbol{\Delta }_{n}\|_{\text{RMS}}^{1/(p+1)}}\;,& &{}\end{array}$$
(8.28)

where p = 2, the order of the algorithm. We use Eq. (8.27) with a modified relation to calculate a new step size for a step rejection:

$$\displaystyle{ f = \frac{0.8} {\|\boldsymbol{\Delta }_{n}\|_{\text{RMS}}^{1/(p+1)}}\;. }$$
(8.29)

Determining the initial time step size requires special consideration. First, the algorithm takes a tentative integration step, using the inverse of the spectral radius σ—the magnitude of the largest eigenvalue—of the Jacobian matrix as the step size. Then, after estimating the error associated with this tentative step, it calculates a new step size following a similar procedure to that given in Eqs. (8.27) and (8.28):

$$\displaystyle\begin{array}{rcl} \delta t_{0} = \frac{1} {\sigma } \;,& &{}\end{array}$$
(8.30)
$$\displaystyle\begin{array}{rcl} \boldsymbol{\Delta }_{0} =\delta t_{0}\left (\mathbf{f}(t_{0} +\delta t_{0},\mathbf{y}_{0} +\delta t_{0}\,\mathbf{f}(t_{0},\mathbf{y}_{0})) -\mathbf{ f}(t_{0},\mathbf{y}_{0})\right )\;,& &{}\end{array}$$
(8.31)
$$\displaystyle\begin{array}{rcl} \delta t_{1} = 0.1 \frac{\delta t_{0}} {\|\boldsymbol{\Delta _{0}}\|_{\text{RMS}}^{1/2}}\;,& &{}\end{array}$$
(8.32)

where \(\|\boldsymbol {\boldsymbol{\Delta }_{0}}\|_{\text{RMS}}\) is evaluated in the same manner as \(\|\boldsymbol {\boldsymbol{\Delta }}_{n+1}\|_{\text{RMS}}\) using Eq. (8.25).

After selecting the optimal time step size to control local error, the algorithm next determines the optimal number of RKC stages in order to remain stable. Due to stiffness, too few stages could lead to instability; in contrast, using more stages than required would add unnecessary computational effort. The number of stages is determined using the spectral radius and time step size:

$$\displaystyle{ s = 1 + \sqrt{1 + 1.54\,\delta t_{n}\,\sigma }\;, }$$
(8.33)

as suggested by Sommeijer et al. [30], where the value 1.54 is related to the stability boundary of the algorithm. Note that s may vary between time steps due to a changing spectral radius and time step size. We recommend using a nonlinear power method [30] to calculate the spectral radius with our RKC implementation; this choice costs an additional vector to store the computed eigenvector, but avoids storing or calculating the Jacobian matrix directly. Following Sommeijer et al. [30], our RKC implementation estimates the spectral radius every 25 (internal) steps or after a step rejection.

3 Source Code

Next, we provide implementation details and source code for the GPU versions of the algorithms described above. The number of unknowns (and corresponding equations) N is represented with the variable NEQN, and the number of ODE systems N ode is defined as numODE in the following code. In order for the GPU algorithms to offer a performance increase over CPU algorithms, N ode should be relatively large. Although the exact number of ODEs where the GPU algorithm becomes faster than its CPU equivalent is problem dependent, Niemeyer and Sung [21] showed that a GPU implementation of the RKCK algorithm for chemical kinetics outperforms an equivalent serial CPU version for as little as 128 ODE systems. All operations are given here in double precision, although depending on the particular needs of the specific application single-precision calculations may be preferable to reduce the computational expense.

In order to take advantage of global memory coalescing on the GPU, we recommend storing the set of dependent variable vectors \(\mathbf{y}_{i}\), where \(i = 1,\mathop{\ldots },N_{\text{ode}}\), in a single one-dimensional array, where variables corresponding to the same unknown for consecutive systems sit adjacent in memory. In other words, if \(\mathbf{Y }\) is a two-dimensional matrix with N ode rows and N columns, where the ith row contains the unknown vector \(\mathbf{y}_{i}\), then \(\mathbf{Y }\) should be stored in memory as a one-dimensional array in column-major ordering. This ensures that adjacent GPU threads in the same warp access adjacent global memory locations when reading or writing equivalent array elements. See Kirk and Hwu [14] or Jang et al. [12] for more details and examples on global memory coalescing.

The following code snippet shows the proper loading of a host array yHost from an arbitrary array y that contains initial conditions for all ODEs:

A similar procedure should be used for the constant parameter vector \(\mathbf{g}\) if needed.

Next, the GPU global memory arrays should be declared and initialized, and the block/grid dimensions set up using

Here, we use simple one-dimensional block and grid dimensions; reshaping the grid should not affect performance, but can be done for convenience. We chose 64 as the block size for problems with less than 4,194,304 ODEs based on experience for chemical kinetics problems [21]. The size should remain a power of two, but different block sizes may be optimal for other problems.

The final step is to set up the ODE integration loop and kernel function execution. The integration driver kernel, to be described shortly, will perform internal sub-stepping as necessary to reach the specified end time. Depending on the objectives, there are various ways to approach this:

  • If only the final integrated results are needed, then a single GPU kernel can be invoked.

  • If intermediate integration results are needed, then an acceptable outer step size over which results will be spaced should be chosen, and the GPU kernel should be invoked inside a loop.

We will leave the code as general as possible by following the second approach, although modifications should be made depending on the desired functionality. In both cases, the global memory array holding the variables to be integrated needs to be transferred to the GPU before, and from the GPU after, each kernel invocation.

1 // set initial time

2 double t = t0;

3 double tNext = t + h;

4

5 while (t < tEnd) {

6 // transfer memory to GPU

7 cudaMemcpy (yDevice, yHost, numODE*NEQN*sizeof(double), cudaMemcpyHostToDevice);

8 intDriver <<<dimGrid, dimBlock>>> (t, tNext, numODE, gDevice, yDevice);

9

10 // transfer memory back to CPU

11 cudaMemcpy (yHost, yDevice, numODE*NEQN*sizeof(double), cudaMemcpyDeviceToHost);

12

13 t = tNext;

14 tNext += h;

15 }

16

17 cudaFree (gDevice);

18 cudaFree (yDevice);

Here, t0 refers to the initial time, tEnd the desired final time, and h the outer step size. In the current form, each outer integration step performed by the GPU will be a “restart” integration, meaning no information about previous steps (e.g., error estimates, step sizes) will be used to assist the startup. This is necessary in certain applications such as reactive-flow simulations (and other simulation methods that use operator splitting), where, after each outer step, integration results are combined with changes due to fluid transport, thereby invalidating stored integration information. However, where possible, better performance may be obtained by transferring the appropriate data from the GPU and using it in the next overall integration step.

The next code snippet contains the general integration driver kernel, suitable for either algorithm:

1 __global__ void

2 intDriver (const double t, const double tEnd, const int numODE,

3 const double* gGlobal, double* yGlobal) {

4 // unique thread ID, based on local ID in block and block ID

5 int tid = threadIdx.x + (blockDim.x * blockIdx.x);

6

7 // ensure thread within limit

8 if (tid < numODE) {

9

10 // local array with initial values

11 double yLocal[NEQN];

12

13 // constant parameter(s)

14 double gLocal = gGlobal[tid];

15

16 // load local array with initial values from global array

17 for (int i = 0; i < NEQN; ++i) {

18 yLocal[i] = yGlobal[tid + numODE * i];

19 }

20

21 // call integrator for one time step

22 integrator (t, tEnd, yLocal, gGlobal);

23

24 // update global array with integrated values

25 for (int i = 0; i < NEQN; ++i) {

26 yGlobal[tid + numODE * i] = yLocal[i];

27 }

28 }

29 }

The function integrator should be replaced with rkckDriver or rkcDriver (given below) depending on the desired integration algorithm.

3.1 RKCK Code

In the following, the source code for the RKCK driver device function is given in functional snippets. First, the minimum and maximum allowable time step sizes are defined, and the initial step size is set as half the integration time.

1 __device__ void

2 rkckDriver (double t, const double tEnd, const double g,

3 double* y) {

4

5 // maximum and minimum allowable step sizes

6 const double hMax = fabs(tEnd - t);

7 const double hMin = 1.0e-20;

8

9 // initial step size

10 double h = 0.5 * fabs(tEnd - t);

Then, inside the time integration loop, the algorithm takes a trial integration step and estimates the error of that step.

1 // integrate until specified end time

2 while (t < tEnd) {

3

4 // limit step size based on remaining time

5 h = fmin(tEnd - t, h);

6

7 // y and error vectors temporary until error determined

8 double yTemp[NEQN], yErr[NEQN];

9

10 // evaluate derivative

11 double F[NEQN];

12 dydt (t, y, g, F);

13

14 // take a trial step

15 rkckStep (t, y, g, F, h, yTemp, yErr);

16

17 // calculate error

18 double err = 0.0;

19 int nanFlag = 0;

20 for (int i = 0; i < NEQN; ++i) {

21 if (isnan(yErr[i])) nanFlag = 1;

22

23 err = fmax(err, fabs(yErr[i] / (fabs(y[i]) + fabs(h * F[i]) + TINY)));

24 }

25 err /= eps;

If the error is too large, the step size is decreased and the step retaken; otherwise, the algorithm accepts the step and calculates the next step size, then repeats the process.

1 // check if error too large

2 if ((err > 1.0) || isnan(err) || (nanFlag == 1)) {

3 // step failed, error too large

4 if (isnan(err) || (nanFlag == 1)) {

5 h *= P1;

6 } else {

7 h = fmax(SAFETY * h * pow(err, PSHRNK), P1 * h);

8 }

9

10 } else {

11 // step accepted

12 t += h;

13

14 if (err > ERRCON) {

15 h = SAFETY * h * pow(err, PGROW);

16 } else {

17 h *= 5.0;

18 }

19

20 // ensure step size is bounded

21 h = fmax(hMin, fmin(hMax, h));

22

23 for (int i = 0; i < NEQN; ++i)

24 y[i] = yTemp[i];

25 }

26 }

27 }

The device function dydt evaluates the derivative function F for the particular problem as in Eq. (8.1) using the input time t, vector of dependent variables y, and constant parameter(s) g. The device function rkcStep, not given here, takes a single integration step using Eqs. (8.2)–(8.9), returning the vector of integrated values yTemp as well as the error vector yErr. A number of constants were used in this function, given here:

1 # define UROUND (2.22e-16)

2 # define SAFETY 0.9

3 # define PGROW (-0.2)

4 # define PSHRNK (-0.25)

5 # define ERRCON (1.89e-4)

6 # define TINY (1.0e-30)

7 const double eps = 1.0e-10;

3.2 RKC Code

The RKC driver algorithm is next given. For this algorithm, the number of stages must be determined at each step to handle local stiffness; to avoid excess computation, a maximum number of stages is first set. In addition, minimum and maximum allowable time step sizes are defined.

1 __device__ void

2 rkcDriver(double t, const double tEnd, const double g, double* y) {

3 // number of steps

4 int numStep = 0;

5

6 // maximum allowable number of RKC stages

7 int mMax = (int)(round(sqrt(relTol / (10.0 * UROUND))));

8

9 // RKC needs at least two stages for second-order accuracy

10 if (mMax < 2) mMax = 2;

11

12 // maximum allowable step size

13 const double stepSizeMax = fabs(tEnd - t);

14

15 // minimum allowable step size

16 double stepSizeMin = 10.0*UROUND*fmax(fabs(t), stepSizeMax);

Then, the algorithm evaluates the derivative using the initial conditions for use as the initial eigenvector estimate for the spectral radius calculation. The calculated eigenvectors are stored and used as initial guesses in later steps.

1 // internal y vector

2 double y_n[NEQN];

3 for (int i = 0; i < NEQN; ++i)

4 y_n[i] = y[i];

5

6 // calculate F_n for initial y

7 double F_n[NEQN];

8 dydt (t, y_n, g, F_n);

9

10 // internal work vector

11 double work[4 + NEQN];

12

13 // load initial estimate for eigenvector

14 if (work[2] < UROUND) {

15 for (int i = 0; i < NEQN; ++i) {

16 work[4 + i] = F_n[i];

17 }

18 }

Inside the time integration loop, the algorithm calculates the spectral radius for the first step—which it next uses to determine the initial step size—and every 25 steps thereafter.

1 // perform internal sub-stepping

2 while (t < tEnd) {

3 double tempArr[NEQN], tempArr2[NEQN], err;

4

5 // if last step, limit step size

6 if ((1.1 * work[2]) >= fabs(tEnd - t)) work[2] = fabs(tEnd - t);

7

8 // estimate Jacobian spectral radius if 25 steps passed

9 if ((numStep % 25) == 0) {

10 work[3] = rkcSpecRad (t, y_n, g, F_n, stepSizeMax, &work[4], tempArr2);

11 }

For the initial step, a trial step is taken using the inverse of the spectral radius as the step size; the resulting error is used to determine an appropriate step size that satisfies error control.

1 // if this is initial step

2 if (work[2] < UROUND) {

3 // estimate first time step

4 work[2] = stepSizeMax;

5

6 if ((work[3] * work[2]) > 1.0) work[2] = 1.0 / work[3];

7 work[2] = fmax(work[2], stepSizeMin);

8

9 for (int i = 0; i < NEQN; ++i) {

10 temp_arr[i] = y_n[i] + (work[2] * F_n[i]);

11 }

12 dydt (t + work[2], tempArr, g, tempArr2);

13

14 err = 0.0;

15 for (int i = 0; i < NEQN; ++i) {

16 double est = (tempArr2[i] - F_n[i]) / (absTol + relTol * fabs(y_n[i]));

17 err += est * est;

18 }

19 err = work[2] * sqrt(err / NEQN);

20

21 if ((P1 * work[2]) < (stepSizeMax * sqrt(err))) {

22 work[2] = fmax(P1 * work[2] / sqrt(err), stepSizeMin);

23 } else {

24 work[2] = stepSizeMax;

25 }

26 }

For all steps following the first, the value stored in work[2] is used for the time step size.

Next, the number of stages is determined using the spectral radius and current time step size, and a tentative integration step performed.

1 // calculate number of steps

2 int m = 1 + (int)(sqrt(1.54 * work[2] * work[3] + 1.0));

3

4 // modify step size based on stages

5 if (m > mMax) {

6 m = mMax;

7 work[2] = ((double)(m * m - 1)) / (1.54 * work[3]);

8 }

9

10 // perform tentative time step

11 rkcStep (t, y_n, g, F_n, work[2], m, y);

The algorithm then estimates the error of that step.

1 // calculate derivative F_np1 with tentative y_np1

2 dydt (t + work[2], y, g, tempArr);

3

4 // estimate error

5 err = 0.0;

6 for (int i = 0; i < NEQN; ++i) {

7 double est = 0.8 * (y_n[i] - y[i]) + 0.4 * work[2] * (F_n[i] + tempArr[i]);

8 est /= (absTol + relTol * fmax(fabs(y[i]), fabs(y_n[i])));

9 err += est * est;

10 }

11 err = sqrt(err / ((double)N));

Based on the error magnitude, the algorithm determines whether to accept the step and proceed to the next step or to decrease the step size and repeat the current step.

1 // check value of error

2 if (err > 1.0) {

3 // error too large, step is rejected

4 // select smaller step size

5 work[2] = 0.8 * work[2] / (pow(err, (1.0 / 3.0)));

6

7 // reevaluate spectral radius

8 work[3] = rkcSpecRad (t, y_n, g, F_n, stepSizeMax, &work[4], tempArr2);

9 } else {

10 // step accepted

11 t += work[2];

12 numStep++;

Finally, for an accepted step, the current step size and error are stored and the next step size is calculated.

1 double fac = 10.0;

2 double temp1, temp2;

3

4 if (work[1] < UROUND) {

5 temp2 = pow(err, (1.0 / 3.0));

6 if (0.8 < (fac * temp2)) fac = 0.8 / temp2;

7 } else {

8 temp1 = 0.8 * work[2] * pow(work[0], (1.0 / 3.0));

9 temp2 = work[1] * pow(err, (2.0 / 3.0));

10 if (temp1 < (fac * temp2)) fac = temp1 / temp2;

11 }

12

13 // set "old" values to those for current time step

14 work[0] = err;

15 work[1] = work[2];

16

17 for (int i = 0; i < NEQN; ++i) {

18 y_n[i] = y[i];

19 F_n[i] = tempArr[i];

20 }

21

22 work[2] *= fmax(P1, fac);

23 work[2] = fmax(stepSizeMin, fmin(stepSizeMax, work[2]));

24 }

25 }

26 }

As before, we do not provide the RKC integration step device function rkcStep, which evaluates Eqs. (8.13)–(8.16). The absolute and relative tolerances absTol and relTol are set as defined constants, e.g.,:

1 const double abs_tol = 1.0e-10;

2 const double rel_tol = 1.0e-6;

Note that these may be modified to more stringent tolerances if desired. The constant UROUND is defined the same as in the RKCK code above. The local work array work contains, in element order:

  1. 0

    the previous step error estimation,

  2. 1

    previous time step,

  3. 2

    current time step,

  4. 3

    spectral radius, and

  5. 4

    vector of eigenvalues (of size N).

The device function rkcSpecRad returns the spectral radius, the largest magnitude eigenvalue; various methods may be used for this purpose depending on the case. We provide GPU source code for a nonlinear power method adopted from Sommeijer et al. [30] that may be used for general applications in the Appendix.

4 Performance Results

We tested the performance of the GPU-based RKCK and RKC integration algorithms using two ODE test cases, ranging the number of ODE systems from 101 to 105. For both cases, all calculations were performed in double precision using a single GPU and single CPU; we compared the performance of the GPU algorithm against serial CPU calculations as well as parallelized CPU performance—via OpenMP [24]—on four cores. We performed the GPU calculations using an NVIDIA Tesla c2075 GPU with 6 GB of global memory, and an Intel Xeon X5650 CPU, running at 2.67 GHz with 256 kB of L2 cache memory per core and 12 MB of L3 cache memory, served as the host processor both for the GPU calculations and the CPU single- and four-core OpenMP calculations. We used the GNU Compiler Collection (gcc) version 4.6.2 (with the compiler options “-O3 -ffast-math -std=c99 -m64”) to compile the CPU programs and the CUDA 5.5 compiler nvcc version 5.5.0 (“-O3 -arch=sm_20 -m64”) to compile the GPU versions. We set the GPU to persistence mode, but also used the cudaSetDevice() to hide any further device initialization delay in the CUDA implementations prior to the timing.

The integration algorithms take as input initial conditions and a global time step, performing internal sub-stepping as necessary. The computational wall-clock times reported represent the average over ten global time steps, which for the GPU versions includes the overhead required for transmitting data between the CPU and GPU before and after each global step. The integrator restarts at each global time step, not storing any data from the previous step—although any sub-stepping performed by the algorithm within these larger steps does benefit from retained information from prior sub-steps. Interested readers should refer to Niemeyer and Sung [21] for more detailed performance evaluations of these algorithms in the context of chemical kinetics problems.

4.1 RKCK Results

We used the Pleiades ODE test problem (PLEI) of Hairer et al. [9, 20] to test the GPU- and CPU-based versions of the RKCK integrator. This nonstiff test case originates from a celestial mechanics problem tracking the coordinates of seven stars; it consists of a set of 14 second-order ODEs based on Newtonian gravitational forces, in the form

$$\displaystyle\begin{array}{rcl} \mathbf{z}^{{\prime\prime}} = \binom{\mathbf{x}}{\mathbf{y}}^{{\prime\prime}} = \binom{\mathbf{f}^{(1)}\left (\mathbf{x},\mathbf{y}\right )}{\mathbf{f}^{(2)}\left (\mathbf{x},\mathbf{y}\right )},\quad \mathbf{z} \in \mathfrak{R}^{14}\;,& &{}\end{array}$$
(8.34)
$$\displaystyle\begin{array}{rcl} x_{i}^{{\prime\prime}} = f_{ i}^{(1)}\left (\mathbf{x},\mathbf{y}\right ) =\sum _{ j\neq i}m_{j}\left (x_{j} - x_{i}\right )/r_{ij}\;,& &{}\end{array}$$
(8.35)
$$\displaystyle\begin{array}{rcl} y_{i}^{{\prime\prime}} = f_{ i}^{(2)}\left (\mathbf{x},\mathbf{y}\right ) =\sum _{ j\neq i}m_{j}\left (y_{j} - y_{i}\right )/r_{ij}\;,& &{}\end{array}$$
(8.36)
$$\displaystyle\begin{array}{rcl} r_{ij} = \left (\left (x_{i} - x_{j}\right )^{2} + \left (y_{ i} - y_{j}\right )\right )^{3/2},\quad i,j = 1,\mathop{\ldots },7\;,& &{}\end{array}$$
(8.37)

where (x i , y i ) and m i  = i are the coordinates and mass of the ith star, respectively. This second-order system can be converted into a system of 28 first-order ODEs of the form by defining \(\mathbf{w} =\mathbf{ z}^{{\prime}}\), such that

$$\displaystyle{ \binom{\mathbf{z}}{\mathbf{w}}^{{\prime}} = \binom{\mathbf{w}}{\mathbf{f}(\mathbf{z})},\quad \binom{\mathbf{z}}{\mathbf{w}} \in \mathfrak{R}^{28}\;. }$$
(8.38)

While the original problem offers specific initial conditions for a single ODE system, here we consider a large number of ODEs with the initial conditions randomly perturbed by a small factor to emulate a sensitivity analysis. We integrated the ODE systems from t = 0 to 1.0 s using 1. 0e − 1 × 10s as the global time step size. We set the RKCK tolerance \(\varepsilon\) (eps in the code) to 1. 0 × 10−10.

Fig. 8.1
figure 1

Speedup factors offered by GPU-based explicit RKCK integration algorithm over single- and four-core CPU-based versions for Pleiades ODE problem. Note that the horizontal axis is displayed in logarithmic scale

Figure 8.1 shows the speedup factors, measured as the ratio of computational times per step, offered by the GPU-based RKCK algorithm over the baseline CPU version for both serial and four-core parallel operation, for numbers of ODE systems ranging from 1,024 to 262,144. The GPU-based algorithm ran faster than the serial and parallel CPU-based algorithms for N ode larger than 4,096 and 8,192, respectively. For the current problem, at best the GPU offered speedup factors of nearly four and two over the serial and four-core CPU implementations, respectively. The non-smooth performance scaling resulted from the randomly perturbed initial conditions.

Note that since each ODE system used randomly perturbed initial conditions, adjacent threads in the GPU implementation handled potentially extremely different initial condition values, resulting in thread divergence due to varying internal time step sizes. Therefore, the results shown here represent a worst-case GPU algorithm performance, particularly compared to applications involving operator-split reactive-flow codes where adjacent threads/ODE systems correspond to neighboring spatial locations. In such situations, initial conditions would be more similar and therefore follow similar instruction pathways. In either case, GPU-based integration algorithms offer performance benefits over the baseline CPU versions. See Niemeyer and Sung [21] for more discussion on this topic.

Furthermore, the current problem involved a relatively simple system of ODEs, limiting the calculations performed on the GPU between the memory transfers before and after each integration step. ODE systems with more complex derivative functions would saturate the GPU with operations, increasing performance. For example, the RKCK algorithm demonstrated by Niemeyer and Sung [21] performed up to 126 times faster on a GPU than on a serial CPU, integrating a chemical kinetics ODE system with nine species participating in 38 reaction steps—requiring significantly more floating-point calculations than the case studied here.

4.2 RKC Results

To demonstrate the performance of the GPU-based RKC algorithm, we used a chemical kinetics problem: the ODE system describing the constant-volume autoignition of ethanol (C2H5OH). We implemented the reaction mechanism of Marinov [19] to describe the oxidation of ethanol, with 57 species participating in 766 irreversible reaction steps. The governing ODE system contained 58 equations: one for temperature T and the rest for species mass fractions \(\mathbf{Y }\):

$$\displaystyle\begin{array}{rcl} \frac{d\mathbf{y}} {dt} = \left (\frac{dT} {dt}, \frac{dY _{1}} {dt},\mathop{\ldots }, \frac{dY _{N_{\text{sp}}}} {dt} \right )^{\intercal }\;,& &{}\end{array}$$
(8.39)
$$\displaystyle\begin{array}{rcl} \frac{dT} {dt} = -\frac{1} {\rho c_{v}}\sum _{i=1}^{N_{\text{sp}}}e_{ i}\omega _{i}W_{i}\;,& &{}\end{array}$$
(8.40)
$$\displaystyle\begin{array}{rcl} \frac{dY _{i}} {dt} = \frac{W_{i}\omega _{i}} {\rho },\quad i = 1,\mathop{\ldots },N_{\text{sp}}\;,& &{}\end{array}$$
(8.41)
$$\displaystyle\begin{array}{rcl} \omega _{i} =\sum _{ j=1}^{N_{\text{reac}}}\left (\nu _{ ij}^{{\prime\prime}}-\nu _{ ij}^{{\prime}}\right )k_{ j}\prod _{k=1}^{N_{\text{sp}}}C_{ k}^{\nu _{kj}^{{\prime}} }\;,& &{}\end{array}$$
(8.42)

where ρ indicates the density, c v the mass-averaged constant-volume specific heat, e i the internal energy of the ith species, W i the molecular weight of the ith species, ν ij ′ ′ and ν ij the forward and reverse stoichiometric coefficients for the ith species in reaction j, C k the molar concentration of the kth species, and N sp and N reac are the numbers of species and reactions, respectively. For a reaction j without pressure dependence, the rate coefficient k j is given in Arrhenius form by

$$\displaystyle{ k_{j} = A_{j}T^{\beta _{j}}\exp \left (\frac{-E_{j}} {\mathcal{R}T} \right )\;, }$$
(8.43)

where \(\mathcal{R}\) is the universal gas constant, A j the pre-exponential coefficient, β j the temperature exponent, and E j the activation energy. Note that reactions can be pressure-dependent (see, e.g., Law [17] for examples of various pressure-dependence formulations); these were also considered in the current implementation.

This problem is moderately stiff using a time step size of δ t = 1. 0e − 6 × 10s for ten global time steps. In this case, we generated initial conditions for the set of ODE systems by sampling the solutions obtained from constant-pressure homogeneous ignition simulations, initiated at 1,600 K, 1 atm, and an equivalence ratio of one.Footnote 1 We assigned these initial conditions sequentially, such that adjacent threads in the GPU implementation contained data from consecutive time steps in the sample—and therefore such threads handled the integration of similar conditions, emulating adjacent spatial locations in an operator-split reactive-flow simulation.

Fig. 8.2
figure 2

Speedup factors offered by GPU-based explicit RKC integration algorithm over single- and four-core CPU-based versions for chemical kinetics ODE problem. Note that both axes are displayed in logarithmic scale

Figure 8.2 shows the speedup factors offered by the GPU-based RKC algorithm over the baseline CPU version for both serial and four-core parallel operation, for numbers of ODE systems ranging from 64 to 16,384. In this case, the GPU-accelerated code ran faster than the serial CPU version for the entire range of ODE system sizes considered, while it offered better performance than the four-core parallel CPU version for 256 ODEs and higher. At best, the GPU-based RKC algorithm ran nearly 64 and 17 times faster than the serial and four-core CPU implementations, respectively. The discontinuity in speedup seen in Fig. 8.2 corresponded to the inclusion of initial conditions leading to greater stiffness.

5 Conclusions

In this chapter, we presented two explicit algorithms appropriate for integrating large numbers of independent ODE systems on GPUs. Specifically, we proposed the fifth-order adaptive Runge–Kutta–Cash–Karp (RKCK) method for nonstiff problems and the stabilized second-order adaptive Runge–Kutta–Chebyshev (RKC) method for problems with moderate levels of stiffness. Source code and implementation details were presented to ease the adoption of such methods, and performance comparison results were presented for each method. The examples shown here served to demonstrate the potential of GPU acceleration where many independent systems of ODEs need to be integrated; in the case of the RKC algorithm, we demonstrated more than an order of magnitude performance increase over an equivalent parallel CPU code running on four cores. The types of scientific and engineering problems dealing with large numbers of ODEs—in particular, reactive-flow models that rely on operator splitting—can benefit significantly from GPU acceleration; interested readers can directly implement the algorithms presented here to such ends, or use them as the beginnings for their own solution.